Traditional Chinese medicine diagnosis and treatment collaboration method based on knowledge graph and large model double driving
By combining the construction of a TCM knowledge graph and a large language model, the problems of false drug combinations and lack of traceability in TCM auxiliary diagnosis and treatment are solved, thus achieving compliance and accuracy in TCM diagnosis and treatment and providing interpretable auxiliary diagnosis and treatment reference information.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SICHUAN FUTAIMEI TECH CO LTD
- Filing Date
- 2026-05-22
- Publication Date
- 2026-06-19
AI Technical Summary
Existing large language models lack strict pathological knowledge and drug incompatibilities rules in TCM-assisted diagnosis and treatment, leading to false drug compatibility illusions and a lack of traceable evidence for diagnostic and treatment outputs, making it difficult to achieve effective collaboration in real clinical environments.
By constructing a TCM knowledge graph, using a state machine conflict detection mechanism to filter newly added knowledge triples, and combining a large language model to retrieve the logical chain of deduction from syndrome to method, method to prescription, and prescription to medicine, structured medical evidence is generated. This evidence is then concatenated with contextual prompts to generate TCM auxiliary diagnostic and treatment reference information.
It has enabled the traceability and compliance of TCM diagnosis and treatment output, reduced the probability of false drug combination illusions, improved the accuracy and transparency of prescription recommendations, and provided interpretable medical evidence.
Smart Images

Figure CN122245648A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of artificial intelligence and smart medical information processing technology, specifically a collaborative method for TCM diagnosis and treatment based on a dual-drive approach of knowledge graphs and large models. Background Technology
[0002] Currently, leveraging artificial intelligence (AI) technology to empower the healthcare industry has become a significant trend in digital healthcare development. However, Traditional Chinese Medicine (TCM) is characterized by its strong empirical nature and highly complex underlying logic of "syndrome differentiation and treatment." The implicit and fragmented nature of its knowledge severely restricts the in-depth application of AI-assisted diagnosis and treatment systems in TCM. Most existing TCM-assisted diagnosis and treatment solutions rely solely on traditional pre-trained large language models. These models are essentially character generation tools based on probability distributions. When directly applied to the field of TCM, they often fail to truly internalize the rigorous deductive logic unique to TCM, which follows the progression from syndrome to method, method to prescription, and prescription to medicine. Lacking rigorous pathological knowledge and rules regarding drug incompatibilities as underlying constraints, large language models are prone to generating false drug compatibility illusions that contradict medical common sense when outputting prescription suggestions, leading to serious medical compliance and medication safety risks.
[0003] Meanwhile, real clinical outpatient clinics are filled with a large number of local characteristic syndromes, special constitutions, and medical records of difficult and complicated diseases. This multi-dimensional and fragmented TCM clinical data often consists of long-tailed small sample data in the model pre-training stage. As a result, conventional large language models have extremely poor fitting effects when faced with complex disease progression and lack the ability to generalize and reason by analogy. In addition, most TCM AI systems currently generate results in unrestrained free-format natural language text, which is completely detached from the standard digital coding system on which the hospital's existing electronic medical records and prescription circulation systems rely. This kind of plain text output, which lacks pre-structured constraints and cross-modal semantic alignment, is not only difficult to be directly parsed and called by clinical business systems, but also cannot provide practitioners with transparent and traceable reasoning basis. As a result, artificial intelligence technology has always remained in the concept demonstration stage and it is difficult to achieve true diagnosis and treatment collaboration in real high-concurrency clinical outpatient environments. Summary of the Invention
[0004] To address the shortcomings of existing technologies, this invention provides a collaborative method for TCM diagnosis and treatment based on a dual-drive approach of knowledge graph and large model. This method solves the problem that existing large language models in TCM auxiliary diagnosis and treatment lack underlying deductive logic and pharmacological contraindications, which easily leads to false drug compatibility illusions and a lack of traceable evidence for the diagnostic and treatment output results.
[0005] To achieve the above objectives, the present invention provides the following technical solution: a collaborative method for TCM diagnosis and treatment based on a dual-drive approach of knowledge graph and large model, comprising the following steps: The process involves acquiring text data of TCM electronic medical records to be processed, extracting natural language terms from the text data, and mapping the natural language terms to a preset set of TCM standard digital codes to obtain structured TCM electronic medical record data. Based on multi-source TCM data, TCM knowledge triplets are extracted. The multi-source TCM data includes ancient TCM books, clinical literature, and medical case data. During the extraction process, a state machine conflict detection mechanism is invoked to detect and filter the newly added knowledge triplets according to preset pathology and drug property contraindication rules. A TCM knowledge graph containing disease syndromes, treatment methods, prescriptions, and Chinese herbal entity nodes is constructed and dynamically updated. The structured TCM electronic medical record data is converted into contextual prompt text representing the patient's current disease status; The disease and syndrome node entities corresponding to the contextual prompt text are matched from the TCM knowledge graph as the retrieval starting point. Based on the retrieval starting point, the path retrieval is carried out according to the deductive logic chain from syndrome to method, method to prescription, and prescription to medicine to extract the corresponding structured medical evidence. The structured medical evidence is used as retrieval enhancement prompt text, which is concatenated with the context prompt text and input into a pre-trained large language model for inference generation, outputting TCM auxiliary diagnosis and treatment reference information.
[0006] Furthermore, the process of mapping the natural language terms to a preset set of TCM standard digital codes to obtain structured TCM electronic medical record data includes: The semantic feature vectors of the natural language terms are extracted by calling a pre-trained language model; Calculate the cosine similarity between the semantic feature vector and the feature vectors of each standard code in the preset set of digital codes for traditional Chinese medicine standards; the following similarity calculation formula is used: ; In the formula, The cosine similarity value between the semantic feature vector representing the input term and the standard encoded feature vector; The semantic feature vector of the natural language term output by the pre-trained language model; The standard encoded feature vector is pre-stored in the standard library. This vector is extracted from historical standardized TCM terminology data through pre-training. The standard digital code corresponding to the maximum cosine similarity is selected as the optimal mapping code to generate the structured TCM electronic medical record data.
[0007] Furthermore, the construction and dynamic updating of a TCM knowledge graph containing entity nodes for diseases, treatments, prescriptions, and Chinese herbal medicines includes establishing a TCM knowledge ontology framework: The TCM knowledge ontology framework is based on the logical chain of deduction from syndrome to method, method to prescription, and prescription to medicine. In the schema design of the graph database, entity types and relationships between entities are defined, including viscera, meridians, etiology, syndromes, and Chinese medicine. State attribute fields and constraint transformation rules representing the disease course relationship are configured in the node model of the corresponding entity.
[0008] Furthermore, the extraction of TCM knowledge triplets based on multi-source TCM data includes: Multiple candidate knowledge representations corresponding to the same entity are extracted from the multi-source TCM data; Extract the preset source weight identifiers corresponding to each of the multi-source TCM data; The final knowledge confidence of the entity is calculated by weighting and summing the preset source weight identifier and the extraction confidence score of the corresponding data source; specifically, it is calculated using the following weighted fusion formula: ; In the formula, The final knowledge confidence of the same entity across multiple data sources; The number of candidate knowledge sources extracted for this entity; For the algorithm model in the first The confidence score output when extracting the entity from a data source; For the first Each data source has a corresponding preset source weight identifier, which is pre-set based on historical experience data according to the reliability level of the data source. For example, the weight of national clinical guidelines is higher than that of ordinary medical case data. Entities with a final knowledge confidence level greater than a preset confidence threshold are extracted for semantic disambiguation and fusion. The preset confidence threshold is preset based on historical model evaluation experience.
[0009] Furthermore, the aforementioned state machine conflict detection mechanism detects and filters newly added knowledge triples based on preset pathological and drug incompatibilities rules, including: Retrieve the set of newly added knowledge triples using an event-driven pattern; Call the state machine conflict detection function to check whether the combination relationship of the head entity, relation and tail entity in the newly added knowledge triple set triggers the pathological and medicinal incompatibilities rule; If not triggered, the corresponding newly added knowledge triplet will be merged into the current version of the TCM knowledge graph; if triggered, the corresponding newly added knowledge triplet will be marked as conflict and an entry verification task instruction will be generated to prevent conflicting data from being entered into the database.
[0010] Furthermore, before inputting the structured medical evidence and the contextual cue text into the pre-trained large language model, the method further includes: Based on the structured TCM electronic medical record data, clinical medical record corpus and TCM classic texts, a multimodal TCM corpus is constructed. The collected raw text data is preprocessed by data cleaning, deduplication, privacy anonymization, and terminology standardization. Based on preset TCM professional rules, the preprocessed text data is systematically annotated with a four-dimensional structure including entity layer, relation layer, attribute layer and derivation logic layer. Obtain the model pre-labeled labels and manual quality inspection benchmark labels corresponding to the sampling test batches; Calculate the accuracy index, Kappa coefficient consistency index, and ontology concept coverage index of the model pre-labeled labels and the manual quality inspection benchmark labels; when there are any indices that are lower than the preset quality evaluation threshold, trigger the re-cleaning and supplementary labeling process for the corresponding test batch of text data. The initial large language model was fine-tuned and trained using the multimodal TCM corpus to obtain the pre-trained large language model.
[0011] Furthermore, the matching of the disease-related node entities corresponding to the contextual prompt text serves as the starting point for retrieval, extracting the corresponding structured medical evidence, including: Calculate the similarity between the contextual prompt text and the semantic feature vectors corresponding to the disease and syndrome node entities in the TCM knowledge graph; The disease / symptom node entities with a similarity greater than a specified threshold are established as the retrieval starting point; Starting from the retrieval starting point, the associated triples that follow the derivation logic chain connection path will be used as the structured medical evidence.
[0012] Furthermore, the step of acquiring the TCM electronic medical record text data to be processed and extracting natural language terms from the text data includes: A preset pre-structure template is loaded into the system interface, and the pre-structure template defines the categorized symptom, syndrome, method, prescription, and medicine input fields; The initial input text corresponding to each field is obtained through the pre-structure template. The natural language processing program is called to perform word segmentation and part-of-speech tagging on the initial input text to extract the natural language terms.
[0013] This invention provides a collaborative method for TCM diagnosis and treatment based on a dual-drive approach of knowledge graph and large-scale model. It has the following beneficial effects: 1. This invention generates structured medical record data by extracting natural language terms from medical records and mapping them to a set of standard digital codes for traditional Chinese medicine. This setting eliminates the data format differences between text descriptions and business systems, enabling the output auxiliary diagnostic reference information to be parsed and accessed by clinical electronic medical record systems.
[0014] 2. When extracting multi-source TCM ternary groups, this invention calls the state machine conflict detection mechanism to filter newly added node combinations based on pathological and drug incompatibilities rules. This step prevents conflicting data that does not conform to the TCM compatibility standards from being written into the database, thus eliminating the potential causes of medical risks from the underlying knowledge source.
[0015] 3. In this invention, the system takes the disease and syndrome nodes matched in the medical records as the starting point for retrieval, and traverses the graph path according to the deductive logic chain from syndrome to method, method to prescription, and prescription to medicine to extract medical evidence. This method provides entity association basis for the process of syndrome differentiation and treatment, so that the output prescription information has a traceable and coherent reasoning path.
[0016] 4. In this invention, the extracted structured medical evidence is established as retrieval enhancement prompt text, which is then assembled with the medical record context and input into a pre-trained large language model. This mechanism uses the correlation between graph nodes to define the character generation boundary of the model, reducing the probability of drug compatibility hallucinations generated by autoregressive inference calculations.
[0017] 5. In this invention, structured electronic medical record data is converted into contextual prompt text representing the patient's disease status. This text sequence serves as a bidirectional input source for synchronously triggering graph node matching and the calculation of underlying parameters of the large model, thus unifying the data feature structure between the path retrieval algorithm and the probability generation task. Attached Figure Description
[0018] Figure 1 This is the overall flowchart of the TCM diagnosis and treatment collaboration method of the present invention; Figure 2 This is a flowchart of the structured TCM electronic medical record data mapping and processing of the present invention; Figure 3 This is a schematic diagram illustrating the construction and dynamic updating of the TCM knowledge graph of the present invention; Figure 4 This is a flowchart illustrating the construction of the multimodal TCM corpus and the fine-tuning of the large model in this invention. Figure 5 This is a schematic diagram illustrating the splicing of prompt instruction templates and the generation of large model reasoning in this invention; Figure 6 This is a schematic diagram of the hardware structure of the electronic device of the present invention. Detailed Implementation
[0019] The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0020] Please see the appendix Figure 1To be continued Figure 6 This invention provides a collaborative method for TCM diagnosis and treatment based on a dual-drive approach of knowledge graph and large language model, executed on a server configured with a graphics processing unit (GPU) computing cluster and a graph database. In this embodiment, the graph database is the Neo4j native graph database, and the large language model is a locally deployed DeepSeek large language model, accelerated by a dedicated AI computing server.
[0021] In this invention, when the server performs data acquisition operations, it triggers the ethics compliance verification module. This module reads the authorization certificate table of the data source to confirm that all accessed TCM electronic medical record text data has obtained informed consent from patients and has been approved by the hospital's ethics committee. Regarding algorithmic fairness, the system's weight configuration file incorporates a non-discrimination constraint script, shielding input based on patient gender, age, and regional attributes, ensuring that inference calculations are performed solely based on objective physical signs from the four diagnostic methods and TCM syndrome differentiation logic. Regarding controllable social impact, the system mandates that the output TCM auxiliary diagnostic reference information be marked as auxiliary decision-making basis, instructing the hospital terminal device to transmit this information to the operating interface of licensed TCM physicians, where human physicians perform the final medical prescription review, forming a closed-loop risk control mechanism.
[0022] This collaborative approach to TCM diagnosis and treatment, driven by both knowledge graphs and large models, may include the following steps: The process involves acquiring text data of TCM electronic medical records to be processed, extracting natural language terms from the text data, and mapping the natural language terms to a preset set of TCM standard digital codes to obtain structured TCM electronic medical record data. Based on multi-source TCM data, TCM knowledge triplets are extracted. The multi-source TCM data includes ancient TCM books, clinical literature, and medical case data. During the extraction process, a state machine conflict detection mechanism is invoked to detect and filter the newly added knowledge triplets according to preset pathology and drug property contraindication rules. A TCM knowledge graph containing disease syndromes, treatment methods, prescriptions, and Chinese herbal entity nodes is constructed and dynamically updated. The structured TCM electronic medical record data is converted into contextual prompt text representing the patient's current disease status; The disease and syndrome node entities corresponding to the contextual prompt text are matched from the TCM knowledge graph as the retrieval starting point. Based on the retrieval starting point, the path retrieval is carried out according to the deductive logic chain from syndrome to method, method to prescription, and prescription to medicine to extract the corresponding structured medical evidence. The structured medical evidence is used as retrieval enhancement prompt text, which is concatenated with the context prompt text and input into a pre-trained large language model for inference generation, outputting TCM auxiliary diagnosis and treatment reference information.
[0023] The server establishes a data receiving interface, which uses a security protocol to securely interface with the hospital's existing HIS (Hospital Information System) and EMR (Electronic Medical Record System) to read TCM electronic medical record text data. The server calls a text parser resident in memory, which extracts natural language terms from the TCM electronic medical record text data. The server executes a terminology mapping program, which runs using a feature vector matching calculation method.
[0024] The server calls the language model to process natural language terms and outputs multi-dimensional numerical vectors. The server performs cosine similarity calculation, and the system calculates the distance between the current natural language term vector and the standard encoding vector in the library.
[0025] The server establishes a knowledge graph construction engine, which reads ancient Chinese medicine books, clinical literature, and medical case data. The engine extracts entities and the paths between entities, generating knowledge triples. The server loads a state machine module, which runs conflict detection logic. The conflict detection logic reads pre-set pathology and drug incompatibility rule files. The state machine module intercepts triples that trigger incompatibilities. The server uses a graph database to store triples that do not trigger incompatibilities. The graph database records disease symptoms, treatment methods, prescriptions, and Chinese medicine node entities.
[0026] The server executes a data conversion program, which assembles structured TCM electronic medical record data into contextual prompt text. This contextual prompt text serves as the input data source for the graph retrieval module, which locates disease and syndrome node entities in the TCM knowledge graph. The graph retrieval module establishes this disease and syndrome node entity as the starting point for retrieval, traversing the network path along a pre-defined derivation logic chain. The traversal path follows the node jump order from syndrome to method, method to prescription, and prescription to medicine, extracting the entity structure along the traversal path to generate structured medical evidence.
[0027] A large language model inference program resides in the server's memory. The server performs string concatenation operations, combining structured medical evidence with contextual prompts to generate target prompt text. The large language model inference program reads this target prompt text, performs low-level parameter calculations and character sequence prediction, and outputs TCM auxiliary diagnostic reference information. The terminal device receives and displays this TCM auxiliary diagnostic reference information.
[0028] In this embodiment, the server loads a preset pre-structure template on the system interface of the terminal device. This pre-structure template defines independent symptom, syndrome, method, prescription, and medicine input fields within the database system table structure. The system receives input operations from the client interface and obtains the initial input text corresponding to each input field.
[0029] The server invokes a natural language processing (NLP) program to process the initial input text. The NLP program, relying on its internal word segmentation algorithm, performs sequence segmentation, dividing the sentence into discrete word sequences. The NLP program loads a Traditional Chinese Medicine (TCM) corpus dictionary and performs part-of-speech tagging on the segmented word sequences. The NLP program executes a conditional filtering function to remove stop words that do not meet the matching rules and extracts natural language terms with entity attributes.
[0030] In this invention, the server invokes a pre-trained language model residing in the hardware's video memory. The system converts natural language terms into input tensors and inputs them into the pre-trained language model. The pre-trained language model performs matrix multiplication calculations, extracts the contextual semantic information of the natural language terms, and outputs the corresponding semantic feature vectors. The server reads a pre-defined set of digital codes for traditional Chinese medicine standards from the read-only storage area. This set is constructed with reference to the international standard for a unified medical language system and a data integration mapping table is established based on the HL7 FHIR and OMOP universal data model standards.
[0031] The server starts the compute node to calculate the cosine similarity between the semantic feature vector and the feature vectors of each standard code in the TCM standard digital coding set. The compute node uses the following mathematical formula: in: The cosine similarity result between the semantic feature vector of a natural language term and the feature vector of a specific standard encoding. This represents the semantic feature vector of a natural language term, calculated by the forward propagation of a pre-trained language model. This represents the standard coding feature vector pre-stored in the TCM standard digital coding set. The system extracts this feature parameter by inputting historical standardized TCM terminology data into a language model with the same structure. And store it in the vector index library.
[0032] The comparison unit inside the server receives a set of cosine similarity results and performs a sorting and comparison operation. The comparison unit selects the maximum cosine similarity value. The system queries the relational database to locate the standard digital code to which the maximum value belongs and establishes it as the optimal mapping code. The server uses this optimal mapping code to construct structured record objects and generate structured TCM electronic medical record data. The server writes the generated structured TCM electronic medical record data to the memory cache.
[0033] In this embodiment, the server performs the construction and dynamic updating of the TCM knowledge graph. The system establishes a TCM knowledge ontology framework, defining a logical chain from syndrome to method, method to prescription, and prescription to medicine as the basic structure of this ontology framework. At the schema design level of the graph database, entity types are defined, covering viscera, meridians, etiology, syndromes, and TCM entity nodes. The graph database engine uses Neo4j, and state attribute fields representing specific disease progression relationships are configured in the node models of corresponding entities. A graph schema containing state attributes and constraint transformation rules is also designed.
[0034] In this invention, the information extraction algorithm component reads multi-source traditional Chinese medicine (TCM) data and extracts multiple candidate knowledge representations corresponding to the same entity. Then, the system extracts preset source weight identifiers corresponding to each multi-source TCM data set. The computing node performs a weighted summation operation, and based on the preset source weight identifier and the extraction confidence score of the corresponding data source, calculates the final knowledge confidence score of the entity. The system uses the following mathematical formula: in: This represents the final knowledge confidence level of a specific entity under the merging of multi-source data; This represents the total number of candidate knowledge sources hit when extracting this entity; The knowledge extraction model is represented in the first... The extraction confidence score output when extracting the entity from a data source; Indicates the first Each data source corresponds to a preset source weight identifier. The system extracts the corresponding value based on the historical document grading standard as the weight identifier. The parameters are set and written into a memory vector. The weight value of the data source corresponding to national-level clinical guidelines is greater than the weight value of the data source corresponding to ordinary medical journals.
[0035] The server's built-in comparator receives the final knowledge confidence score and compares it with a preset confidence threshold in the system memory. This confidence threshold is preset based on the accuracy evaluation index of the historical model data. When the final knowledge confidence score is determined to be greater than the preset confidence threshold, the data fusion program extracts the aforementioned entities for semantic disambiguation, merges entity objects with the same semantic reference, and outputs a set of new knowledge triples to be added to the database.
[0036] The system's underlying event listener adopts an event-driven model, capturing the newly added knowledge triplet set and triggering the state machine module to start running. Simultaneously, a resident scheduled task process runs in the server background, executing periodic scan commands according to a preset time cycle. This retrieves historical node data from the graph database that has exceeded a preset validity period and pushes this historical node data to the state machine module, forming a hybrid update mechanism combining event-driven and periodic scanning. The state machine module calls its internal conflict detection function; this function reads the preset pathology and drug incompatibilities rule files and the latest clinical guideline version library from the rule storage system. The incompatibilities rule files contain a structured record of the compatibility contraindications stipulated in traditional Chinese medicine and the corresponding relationships of drug contraindications under specific pathological syndrome states.
[0037] The state machine module parses the newly added knowledge triples, checking whether the logical combination of the head entity, relation, and tail entity within the triples triggers preset pathological and medicinal contraindication rules, or whether it logically conflicts with the latest clinical guidelines. If the system comparison result does not trigger any rules, the database write interface calls the Cypher statement to directly merge the corresponding newly added knowledge triples into the current version of the TCM knowledge graph. If the comparison result triggers relevant contraindication rules, the state machine module immediately suspends the current data write transaction, adds a conflict identifier field to the newly added knowledge triples that triggered the rules, generates an inbound verification task instruction containing data traceability information, and blocks the inbound of conflicting data. For historical node data that conflicts with the latest guidelines, the state machine module adds an outdated knowledge tag and generates a new node version number. The graph database engine archives the nodes carrying outdated knowledge tags to the historical data partition and establishes traceability edges pointing to the new version nodes, realizing version management and data traceability of the knowledge graph.
[0038] In this embodiment, the server performs the task of constructing a multimodal TCM corpus. The system integrates the aforementioned structured TCM electronic medical record data, clinical medical record corpus, and external TCM classic texts to generate a raw text dataset. The data sources of the raw text dataset cover the People's Medical Publishing House medical text knowledge base and the TCM-specific knowledge base, including standardized TCM ancient books, classic medical records of famous veteran TCM doctors, and multi-disease knowledge bases. The data preprocessing program reads the raw text dataset and performs data cleaning and deduplication operations by executing a character-level regular expression matching algorithm. The privacy protection component scans the text character stream, extracts and masks patient personal identification characters, and performs privacy desensitization operations. The terminology normalization engine compares with the standard medical dictionary trie, replaces non-standard aliases with standard terms, and completes the preprocessing operations.
[0039] In this invention, the annotation engine receives preprocessed text data and performs four-dimensional structural annotation based on the TCM professional rules configured in the system. The annotation engine extracts and annotates the organ and syndrome entity names in the text at the entity layer; records and annotates the pointing mapping relationships between entities at the relation layer; adds and annotates the temperature, coolness, coldness and heat characteristic parameters of the entities at the attribute layer; and records and annotates the step sequence labels from the initial symptoms to the final prescription at the derivation logic layer.
[0040] The system runs a quality assessment program and enters the feedback optimization phase. The sampling program extracts test batch text data from the full set of labeled data according to a preset step size. The data extraction component reads the database table structure, obtains the model pre-label file corresponding to the sampled test batch, and the server establishes a network communication connection to send data packages to the manual quality inspection terminal and receive the manual quality inspection benchmark label file returned by the terminal.
[0041] The server's built-in compute nodes run evaluation algorithms, comparing the model's pre-labeled tags with manually inspected benchmark tags, and calculating accuracy, Kappa coefficient consistency, and ontology concept coverage metrics. The compute nodes use the following mathematical formula to calculate the Kappa coefficient consistency metric: in: This parameter represents the consistency score between the model's pre-annotation results and the manually annotated baseline. This represents the ratio of the number of data entries that completely match between the two sets of labels to the total number of entries in the test batch. This represents the expected random consistency rate. The system pre-calculates this feature parameter by reading the label frequency statistics from historical corpus annotation work. And persist it in the server configuration table.
[0042] The server invokes the comparison logic unit to read the preset quality evaluation threshold from the system memory space; this threshold is set based on the lowest corpus standard used for training historically available models. The system compares the various quality indicators generated above with this preset quality evaluation threshold. When the logic judgment step detects one or more indicators that are lower than the preset quality evaluation threshold, the server instruction scheduling module generates a rollback trigger signal, initiating a process of re-cleaning and supplementing the annotation of the corresponding test batch of text data. After the system confirms that the indicators are compliant, it serializes and outputs the data, generating a multimodal TCM corpus.
[0043] The system enters the model fine-tuning phase. The server executes a data augmentation script to perform entity replacement and logical reorganization on medical case data of difficult and complicated diseases and special constitutions in the multimodal TCM corpus where the sample size is below a preset threshold. This is done by calling synonym relationships and syndrome evolution rules from the TCM knowledge graph to generate augmented sample data. The system then merges the augmented sample data with the original multimodal TCM corpus to construct an enhanced training set.
[0044] The model fine-tuning framework loads the enhanced training set as the training data source. The server's GPU memory loads the initial DeepSeek large language model architecture file and initial weight matrix. The system configures the model's network layer structure, setting the number of Transformer decoder layers (including self-attention mechanism) to the preset number, and setting the hidden layer feature dimension and the number of multi-head attention heads. The system runs the gradient descent algorithm component to read the input-output feature pairs from the enhanced training set. The computation nodes use the cross-entropy loss function to calculate the loss value between the model's predicted word probability distribution and the real clinical label sequence. Subsequently, the system calls the AdamW optimizer algorithm, performs backpropagation based on the loss value, and dynamically adjusts the node connection weight matrix of the hidden layers according to the preset learning rate and weight decay rate to fine-tune the initial large language model. After multiple rounds of iterative training, the system monitors the convergence status of the loss value on the validation set; when the loss value does not decrease significantly for several consecutive rounds, the system triggers an early stopping mechanism, saves the final network weight parameters, and obtains the pre-trained large language model.
[0045] The model fine-tuning framework loads the enhanced training set as the training data source. The server loads the initial large language model architecture file and the initial weight matrix. The system runs the gradient descent algorithm component to read the data features within the enhanced training set, adjusts the node connection weight matrix of the hidden layers of the network, and fine-tunes the initial large language model. After multiple rounds of iterative training, the system saves the final network weight parameters, resulting in the pre-trained large language model.
[0046] In this embodiment, the server executes a graph path retrieval mechanism based on the underlying logic of traditional Chinese medicine. The system reads the structured TCM electronic medical record data in the memory buffer, and the server runs a text conversion script to parse the structure dictionary in the structured TCM electronic medical record data. The text conversion script combines the extracted field values in sequence to generate contextual prompt text that represents the patient's current disease status.
[0047] In this invention, the server utilizes hardware acceleration resources to load a pre-trained language model. The system inputs the contextual prompt text into the input layer of the pre-trained language model, performs feature encoding through a multi-layer network, and outputs a text semantic feature vector representing the text. Simultaneously, the graph database interface reads the node index file to obtain the node semantic feature vectors corresponding to all disease and syndrome node entities in the Traditional Chinese Medicine knowledge graph.
[0048] The computing array inside the server receives the text semantic feature vector and the semantic feature vectors of each node. The computing array performs a similarity calculation task, determining the similarity between the text semantic feature vector and the semantic feature vectors of the nodes corresponding to disease / syndrome nodes in the Traditional Chinese Medicine knowledge graph. The system uses the following calculation formula: in: This represents the semantic similarity value between the contextual hint text and the specific disease / symptom node entity; This represents the text semantic feature vector output by the system after feature encoding of the context prompt text; This represents the semantic feature vector of the node corresponding to the pre-defined disease / symptom node entity in the graph database. The graph initialization program extracts the text description of the disease / symptom node and inputs it into the language model to pre-calculate and generate this feature parameter. The data is then persistently stored in the node attribute fields of the graph system.
[0049] The system starts the comparison logic unit, reads the specified threshold from the configuration table. This specified threshold parameter is preset based on historical knowledge retrieval hit probability distribution curve experience data. The comparison logic unit compares each semantic similarity value with the specified threshold. The system filters nodes below the specified threshold and establishes disease node entities with similarity greater than the specified threshold as the retrieval starting point. The graph database query engine receives the identifier of the retrieval starting point and enters the path extension stage.
[0050] The graph database query engine runs a graph traversal algorithm, which strictly follows the logical chain of deduction from syndrome to method, method to prescription, and prescription to medicine as defined by the TCM knowledge ontology framework for path retrieval. During traversal, the system concurrently calls a rule engine script in memory. This script reads the four diagnostic methods and physical signs data from the medical record context, calls the system's preset multi-dimensional feature matrix, calculates the conditional association probability between the combined feature vector of the current four diagnostic methods and physical signs and the attributes of each treatment node in the graph, and achieves hypersemantic modeling through this probability distribution, thereby accurately capturing the association constraints between various symptoms. Based on these association constraints, the graph traversal algorithm identifies different personalized treatment paths. First, it takes the retrieval starting point as the root node of the graph structure, matches edge vectors carrying association attributes, and extends the traversal to adjacent treatment node entities.
[0051] The graph database query engine continuously changes the current traversal cursor, using the treatment node entity as the base node, querying logically connected relationships in the relation constraint table, tracing to the prescription node entity, and the query engine sends a query request from the prescription node entity, addressing to the underlying Chinese medicine node entity along the matching relationship edges. The system records all node identifiers and edge types traversed by the query cursor.
[0052] The server executes the data extraction logic. Starting from the retrieval point, the system extracts all entity and relation data that follow the connected path of the derivation logic chain, combines them to generate a set of association triples, and outputs the association triples as the corresponding structured medical evidence. The server writes the structured medical evidence into a shared memory pool for the large language model inference program to call.
[0053] In this embodiment, the server performs knowledge-constrained enhanced prompt text assembly and large language model inference operations. The system process reads the structured medical evidence stored in the shared memory pool. The data formatting script parses the set of associated triples within the structured medical evidence. According to preset syntax rules, the data formatting script maps the nodes and edge attributes containing evidence, methods, prescriptions, and medicines to natural language description strings. The system then establishes this string as the retrieval-enhanced prompt text.
[0054] In this invention, the server calls a template loader to read a preset prompt instruction template from a configuration file. The prompt instruction template is divided into a background knowledge area, a patient status area, and a task instruction area in a memory data structure. Within the patient status area, the system further allocates independent labels for the four diagnostic methods (palliative diagnoses, diagnostic methods, and infirmities), vital signs data, and past medical history. A string concatenation function performs a text merging operation. The string concatenation function writes the retrieved enhanced prompt text into the background knowledge area, and simultaneously writes the context prompt text generated in the aforementioned steps and stored in the cache into the respective independent labels within the patient status area according to their attribute correspondences. The system combines the contents of these areas to generate an input prompt sequence with fixed logical boundaries.
[0055] The server uses the computing bus interface to transmit the input prompt sequence to the GPU cluster's memory. The pre-trained large language model loads this input prompt sequence, parses the sequence encoding matrix in the hidden layer of the large language model's network, and performs autoregressive inference calculations. The model's inference program predicts the target word at the current time step based on the input prompt sequence and the generated historical words. The pre-trained large language model runs the following conditional probability generation formula: in: This represents the overall probability of generating the target text sequence given a sequence of input prompts. This represents the input prompt sequence processed by the encoder, corresponding to the semantic features formed by concatenating the background knowledge area, patient status area, and task instruction area mentioned above. This represents the complete target text sequence output by a pre-trained large language model. Indicates the first A single lexical unit generated at each time step; Indicates the first The set of all tokens that have been output before the current time step; This represents the total step size of the generated sequence. The system performs forward propagation based on the network weight matrix fixed during the pre-training stage of the large language model in the multimodal TCM corpus, obtains the probability distribution features of candidate words, and extracts the word units corresponding to the maximum probability values.
[0056] The large language model inference program executes the above generation instructions in a loop until it captures a preset marker indicating the end, at which point it stops the autoregressive calculation. The memory control unit receives the complete target text sequence data block, the format parsing component processes the target text sequence data block, extracts field information containing etiology analysis, treatment principles and methods, and recommended prescriptions, the system serializes the above field information, outputs TCM auxiliary diagnosis and treatment reference information, and the network communication interface encapsulates the TCM auxiliary diagnosis and treatment reference information into network data packets and delivers them to the hospital information system terminal for presentation.
[0057] This embodiment illustrates the clinical application steps of the present invention. The system front-end receives medical record data entered by the operator, and the terminal device sends the medical record text to the server communication interface. The medical record text records the vocabulary of the chief complaint symptoms, covering cough and sputum production. The system runs a text parsing component to read the medical record text and extract terminology. A natural language processing program processes the terminology and outputs a feature vector. The system compares the feature vector with the standard encoding vector in the dictionary table and extracts the mapping code; the system establishes the mapping code as phlegm-dampness accumulation in the lungs syndrome.
[0058] The graph query engine reads the phlegm-dampness accumulating in the lungs from the memory space and sets it as the retrieval starting point. The engine then obtains the path topology relationships associated with the retrieval starting point and calculates the product of the confidence scores of the derived paths. The system executes the following mathematical formula: in: This represents the product of the confidence levels of the derivation path. This represents the set of nodes and edges in the graph database that extend from the disease / symptom nodes to the traditional Chinese medicine nodes; This represents the edges that connect nodes. This represents the weight score parameter of the associated edge; the system reads the attribute data fields within the knowledge graph edge model structure to extract this weight score parameter.
[0059] In this invention, the graph query engine traverses the node connections to reach the treatment method node; the graph query engine extracts the treatment method term as "drying dampness and resolving phlegm". The graph query engine follows the path to locate the prescription node and extracts the prescription term as "Er Chen Tang". The graph query engine addresses to the Chinese herbal medicine node along the compatibility relationship edge and extracts the Chinese herbal medicine terms, covering Pinellia ternata, Poria cocos, and ginger. The system calls the Chinese herbal medicine digital encoding library to convert the extracted Pinellia ternata, Poria cocos, and ginger entity terms into a unified structured digital code containing specifications, grades, and pharmacopoeia aliases, performs cross-modal semantic alignment operations, and the graph query engine combines the entity terms to generate associated triples.
[0060] The state machine module captures associated triples and reads the Chinese herbal medicine names within the prescription sequence. The state machine module then retrieves the contraindication rule configuration file. The system detects the compatibility relationships between Pinellia ternata and other Chinese herbal medicine names. If the system determines that the prescription sequence does not trigger the interception conditions in the rule configuration file, the system control data interface module allows the aforementioned data records to pass.
[0061] The system assembles association triples into retrieval enhancement prompt text and concatenates it with medical record text. The system constructs a prompt character sequence. A pre-trained large language model loads the prompt character sequence; the large language model performs matrix multiplication operations and outputs auxiliary diagnostic reference information text. The system sends the auxiliary diagnostic reference information text to the terminal device's display panel via a network communication interface, where it is displayed in a split-screen visualization: the main screen displays the auxiliary diagnostic reference information text generated by the large language model, while the secondary screen synchronously maps and highlights the association triple derivation path from disease symptoms to traditional Chinese medicine extracted from the graph database. This provides practicing physicians with transparent and interpretable medical traceability evidence, achieving true human-machine collaboration.
[0062] Based on actual operational test data, the system in this embodiment improves various quantitative indicators of TCM auxiliary diagnosis and treatment through a knowledge graph path retrieval mechanism and an enhanced prompt instruction collaborative algorithm. The system improves the prescription recommendation accuracy for local characteristic syndromes (such as phlegm-dampness accumulation in the lungs) by about 15%. Due to the use of a state machine to pre-execute contraindication rules for filtering, the probability of the large model generating false drug compatibility hallucinations is reduced by about 30%. At the same time, the precise positioning of the starting point of the graph database retrieval compresses the length of the prompt word sequence, and reduces the latency of the large model in generating auxiliary diagnosis and treatment reference information in a single inference by about 20%, meeting the real-time response requirements of high-concurrency clinical outpatient scenarios.
[0063] It should be noted that, in order to realize the above-mentioned collaborative TCM diagnosis and treatment method driven by both knowledge graphs and large models, this embodiment of the invention also provides an electronic device. The electronic device includes: at least one processor; and a memory communicatively connected to the at least one processor. The memory stores computer program instructions executable by the at least one processor, which, when executed by the at least one processor, enable the electronic device to perform the steps described in the above embodiments of the invention.
[0064] Furthermore, embodiments of the present invention also provide a non-volatile computer-readable storage medium storing computer program instructions thereon. When executed by a processor of a computer or server, these computer program instructions implement the steps of the TCM diagnosis and treatment collaborative method based on a dual-driven knowledge graph and large model described in any of the above embodiments of the present invention. Those skilled in the art should understand that various operations of the present invention can be implemented through hardware, software, firmware, or any combination thereof.
Claims
1. A collaborative TCM diagnosis and treatment method driven by both knowledge graphs and large-scale models, characterized in that: Includes the following steps: The process involves acquiring text data of TCM electronic medical records to be processed, extracting natural language terms from the text data, and mapping the natural language terms to a preset set of TCM standard digital codes to obtain structured TCM electronic medical record data. Based on the extraction of TCM knowledge triplets from multi-source TCM data, the multi-source TCM data includes ancient TCM books, clinical literature and medical case data; During the extraction process, the state machine conflict detection mechanism is invoked to detect and filter the newly added knowledge triples based on the preset pathology and drug incompatibilities rules, and to construct and dynamically update a TCM knowledge graph containing disease syndromes, treatment methods, prescriptions and Chinese medicine entity nodes. The structured TCM electronic medical record data is converted into contextual prompt text representing the patient's current disease status; The disease and syndrome node entities corresponding to the contextual prompt text are matched from the TCM knowledge graph as the retrieval starting point. Based on the retrieval starting point, the path retrieval is carried out according to the deductive logic chain from syndrome to method, method to prescription, and prescription to medicine to extract the corresponding structured medical evidence. The structured medical evidence is used as retrieval enhancement prompt text, which is concatenated with the context prompt text and input into a pre-trained large language model for inference generation, outputting TCM auxiliary diagnosis and treatment reference information.
2. The TCM diagnosis and treatment collaboration method based on knowledge graph and large model dual-driven approach according to claim 1, characterized in that, The process of mapping the natural language terms to a preset set of TCM standard digital codes to obtain structured TCM electronic medical record data includes: The semantic feature vectors of the natural language terms are extracted by calling a pre-trained language model; Calculate the cosine similarity between the semantic feature vector and each standard coding feature vector in the preset set of TCM standard digital codes; The standard digital code corresponding to the maximum cosine similarity is selected as the optimal mapping code to generate the structured TCM electronic medical record data.
3. The TCM diagnosis and treatment collaboration method based on knowledge graph and large model dual-driven approach according to claim 1, characterized in that, The construction and dynamic updating of a TCM knowledge graph, which includes entity nodes for diseases, treatments, prescriptions, and Chinese herbal medicines, includes establishing a TCM knowledge ontology framework: The TCM knowledge ontology framework is based on the logical chain of deduction from syndrome to method, method to prescription, and prescription to medicine. In the schema design of the graph database, entity types and relationships between entities are defined, including viscera, meridians, etiology, syndromes, and Chinese medicine. State attribute fields and constraint transformation rules representing the disease course relationship are configured in the node model of the corresponding entity.
4. The TCM diagnosis and treatment collaboration method based on knowledge graph and large model dual-driven approach according to claim 1, characterized in that, The method for extracting TCM knowledge triplets based on multi-source TCM data includes: Multiple candidate knowledge representations corresponding to the same entity are extracted from the multi-source TCM data; Extract the preset source weight identifiers corresponding to each of the multi-source TCM data; The final knowledge confidence of the entity is calculated by weighting and summing the preset source weight identifier and the extraction confidence score of the corresponding data source. Entities whose final knowledge confidence is greater than a preset confidence threshold are extracted for semantic disambiguation and fusion.
5. The TCM diagnosis and treatment collaboration method based on knowledge graph and large model dual-driven approach according to claim 1, characterized in that, The aforementioned state machine conflict detection mechanism detects and filters newly added knowledge triples based on preset pathology and drug incompatibilities rules, including: Retrieve the set of newly added knowledge triples using an event-driven pattern; Call the state machine conflict detection function to check whether the combination relationship of the head entity, relation and tail entity in the newly added knowledge triple set triggers the pathological and medicinal incompatibilities rule; If not triggered, the corresponding newly added knowledge triplet will be merged into the current version of the TCM knowledge graph; if triggered, the corresponding newly added knowledge triplet will be marked as conflict and an entry verification task instruction will be generated to prevent conflicting data from being entered into the database.
6. The TCM diagnosis and treatment collaboration method based on knowledge graph and large model dual-driven approach according to claim 1, characterized in that, Before inputting the structured medical evidence and the contextual cue text into the pre-trained large language model, the method further includes: Based on the structured TCM electronic medical record data, clinical medical record corpus and TCM classic texts, a multimodal TCM corpus is constructed. The initial large language model was fine-tuned and trained using the multimodal TCM corpus to obtain the pre-trained large language model.
7. The TCM diagnosis and treatment collaboration method based on knowledge graph and large model dual-driven approach according to claim 6, characterized in that, The construction of the multimodal TCM corpus includes: The collected raw text data is preprocessed by data cleaning, deduplication, privacy anonymization, and terminology standardization. Based on preset TCM professional rules, the preprocessed text data is structurally labeled with a four-dimensional structure including entity layer, relation layer, attribute layer and derivation logic layer.
8. The TCM diagnosis and treatment collaboration method based on knowledge graph and large model dual-driven approach according to claim 7, characterized in that, After completing the structural annotation of the four-dimensional architecture, the method also includes feedback optimization of the multimodal TCM corpus: Obtain the model pre-labeled labels and manual quality inspection benchmark labels corresponding to the sampling test batches; Calculate the accuracy index, Kappa coefficient consistency index, and ontology concept coverage index of the model pre-labeled labels and the manual quality inspection benchmark labels; When an indicator falls below the preset quality evaluation threshold, a process of re-cleaning and supplementing the annotation of the corresponding test batch of text data is triggered.
9. The TCM diagnosis and treatment collaboration method based on knowledge graph and large model dual-driven approach according to claim 1, characterized in that, The matching of the disease-related node entities corresponding to the contextual prompt text serves as the starting point for retrieval, extracting the corresponding structured medical evidence, including: Calculate the similarity between the contextual prompt text and the semantic feature vectors corresponding to the disease and syndrome node entities in the TCM knowledge graph; The disease / symptom node entities with a similarity greater than a specified threshold are established as the retrieval starting point; Starting from the retrieval starting point, the associated triples that follow the derivation logic chain connection path will be used as the structured medical evidence.
10. The TCM diagnosis and treatment collaboration method based on knowledge graph and large model dual-driven approach according to claim 1, characterized in that, The process of acquiring the TCM electronic medical record text data to be processed and extracting natural language terms from the text data includes: A preset pre-structure template is loaded into the system interface, and the pre-structure template defines the categorized symptom, syndrome, method, prescription, and medicine input fields; The initial input text corresponding to each field is obtained through the pre-structure template. The natural language processing program is called to perform word segmentation and part-of-speech tagging on the initial input text to extract the natural language terms.