An automated medical knowledge graph construction method and system
By employing a hybrid decision-making algorithm prioritizing whitelists and a dynamic constraint injection layer for cascaded verification, combined with a multi-source traceability incremental fusion algorithm, the illusion and data clutter problems existing in the construction of medical knowledge graphs using large language models were resolved. This resulted in efficient and stable construction of medical knowledge graphs, improving the robustness and adaptability of the system.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- ZOE SOFT CORP LTD
- Filing Date
- 2026-05-14
- Publication Date
- 2026-06-19
AI Technical Summary
In existing technologies, large language models (LLMs) are prone to generating non-existent entities or relationships in the construction of medical knowledge graphs. They are difficult to strictly follow user-defined schemas and lack rigorous verification mechanisms, resulting in messy data and low construction efficiency.
A hybrid decision-making algorithm based on whitelist short-circuit priority and a dynamic constraint injection layer are used for cascaded verification. Combined with a multi-source traceability incremental fusion algorithm, a complete automated medical knowledge graph construction process is formed through medical text segmentation, generative extraction and multi-level verification, which reduces the calls to large language models and improves the accuracy of judgment and the efficiency of construction.
It achieves stable, efficient, and high-quality construction of medical knowledge graphs, reduces the risk of hallucinations, improves verification consistency and coverage, ensures the integrity and credibility of the graph, and enhances the system's generalization ability and domain adaptability through a closed-loop optimization mechanism.
Smart Images

Figure CN122240856A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of medical knowledge graph technology, specifically to an automated method and system for constructing medical knowledge graphs. Background Technology
[0002] Knowledge graph technology aims to transform unstructured data into structured networks of entities and relationships. With the emergence of large language models (LLMs), information extraction based on LLMs has reduced its reliance on labeled data. However, existing technologies face the following challenges: (1) Illusion problem: LLM is prone to generating non-existent entities or fabricated relationships.
[0003] (2) Poor standardization: The extraction results often fail to strictly follow the user-defined schema, resulting in messy data.
[0004] (3) Lack of rigorous verification: Traditional large model applications are mostly one-way generation, lacking the "generation-checking" binary thinking mechanism analogous to humans. Summary of the Invention
[0005] To address these issues, this invention proposes an automated method and system for constructing medical knowledge graphs.
[0006] According to one aspect of the present invention, an automated method for constructing a medical knowledge graph is proposed, comprising the following steps: S1, Read the medical text and complete the medical text segmentation, call the large language model generator to read the segmented medical text and extract the medical knowledge candidate triple set, the medical knowledge candidate triple includes head entity, relation and tail entity; S2, perform cascaded validation on each medical knowledge candidate triplet in the medical knowledge candidate triplet set to obtain valid medical knowledge triplets. The cascaded validation includes a first-level validation and a second-level validation, specifically including: S201, Upload the domain standard medical glossary as the whitelist for the first-level verification, and use the hybrid decision algorithm of whitelist short-circuit priority to perform first-level verification on the medical knowledge candidate triples. The medical knowledge candidate triples that pass the first-level verification are marked as the medical knowledge valid triples. The hybrid decision algorithm of whitelist short-circuit priority is used to determine whether the medical knowledge candidate triples hit the whitelist. S202, if the medical knowledge candidate triplet fails the first-level verification, the medical knowledge candidate triplet is input into a large language model discriminator for a second-level verification. The large language model discriminator includes a dynamic constraint injection layer, which constructs three-dimensional constraint information to build prompt words. The large language model discriminator performs semantic rationality judgment on the medical knowledge candidate triplet based on the prompt words to determine whether the medical knowledge candidate triplet is a valid medical knowledge triplet. The three-dimensional constraint information includes context anchor constraint information, industry discrimination standard constraint information, and output protocol constraint information. S3, Based on the multi-source tracing incremental fusion algorithm, the effective triples of medical knowledge are constructed into a medical knowledge graph. The multi-source tracing incremental fusion algorithm is used to incrementally fuse the attributes of medical entity nodes or medical relationship edges constructed in the medical knowledge graph.
[0007] By segmenting medical text, generative extraction, two-level cascaded verification, and multi-source incremental fusion, a complete automated medical knowledge graph construction process is formed. Whitelist short-circuit verification reduces the use of large language model LLM, and three-dimensional constraints improve the accuracy of discrimination, solving the problems of traditional extraction illusions, unreliable verification, and low construction efficiency, thus achieving stable, efficient, and high-quality automated construction.
[0008] Specifically, the entity types of the head entity or tail entity include at least diseases, drugs, and symptoms, and the relationship types of the relationship include at least treatment, induction, relief, and inhibition.
[0009] Specifically, the dynamic constraint injection layer described in S202, which constructs three-dimensional constraint information to build prompt words, includes: Configure the target medical knowledge triple structure and parse the target medical knowledge triple structure into a structured index table. The structured index table contains a set of entity types and a set of relation types. Based on the structured index table, extract the entity types of the head entity and tail entity in the candidate medical knowledge triple and the relation types of the relation, and perform type constraints to construct the context anchor constraint information. The entity names corresponding to the head or tail entities in the candidate triples of medical knowledge are constrained to be standardized medical terms. The relationship between the head and tail entities must conform to medical logic. The head, tail entities and relationships are also constrained to be explicitly recorded in the medical text in a literal or semantically equivalent form to construct the industry discrimination standard constraint information. A mandatory JSON output protocol is defined, and the output of the large language model discriminator must include a valid boolean identifier field and a string-type decision criterion field to construct the output protocol constraint information.
[0010] By precisely constructing prompt words through three types of constraints—contextual anchors, industry three-dimensional standards, and output protocols—the LLM discrimination space is limited, reducing irrelevant outputs and logical errors. This makes the discrimination more aligned with domain rules, improving validation consistency and reliability, and reducing the probability of illusions and false positives.
[0011] Specifically, in S201, a hybrid decision-making algorithm combining whitelist and short-circuit priority is used to perform a first-level verification of the candidate triples of medical knowledge, which includes: After normalizing and preprocessing the entity names corresponding to the head or tail entities in the candidate medical knowledge triples, a first precise match is performed based on the whitelist. If both the head and tail entity names in the candidate medical knowledge triples successfully complete the first precise match, the first-level verification is passed. If it fails, the entity names corresponding to the head and tail entities in the candidate medical knowledge triples are transformed based on a pre-loaded domain standard medical thesaurus synonym mapping table and a second precise match is performed. If both the head and tail entity names in the candidate medical knowledge triples successfully complete the second precise match, the first-level verification is passed. If it fails, the similarity between the entity names corresponding to the head and tail entities in the candidate medical knowledge triples and the whitelisted words is calculated based on a similarity formula. If the similarity reaches a preset similarity threshold, the first-level verification is passed. If it fails, the second-level semantic discrimination verification is performed. This three-level progressive verification—precise matching, synonym mapping matching, and similarity matching—ensures the accuracy of terminology recognition while improving compatibility with variants, aliases, and slightly erroneous entities. Stepwise filtering reduces LLM dependency and expands coverage, achieving a balance between verification accuracy and breadth, and improving the overall robustness of the system.
[0012] Specifically, the large language model discriminator also includes a robust protocol parsing layer. This layer performs an initial JSON format parsing and field validation. If both parsing and validation are successful, a successful parsing result is output. If it fails, the robust protocol parsing layer uses regular expressions to locate the JSON boundaries of the failed parsing result, extracts the content enclosed in the outermost brackets, and performs a second JSON format parsing and field validation. If both parsing and validation are successful, a successful parsing result is output. If it fails, an exception circuit breaker mechanism is activated. Keyword semantic analysis is used to statistically analyze the positive or negative tendency of the failed parsing result. When a clear judgment cannot be made, a parsing result with a valid identifier field set to True is returned by default. Through a three-layer fallback parsing and exception circuit breaker mechanism, it is compatible with non-standard LLM output, ensuring stable extraction of Boolean validation results from unordered medical text. A conservative strategy avoids erroneous data from being entered into the database, improving the reliability of the automated pipeline and enabling stable operation of the system under abnormal conditions.
[0013] Specifically, this also includes S4, which collects parsing failure results during the cascading verification process as boundary cases for manual review, and feeds back the correct samples after manual review to the construction of the prompt words for closed-loop optimization. By collecting boundary cases and manually reviewing feedback, a closed-loop optimization mechanism is formed. Continuously supplementing high-quality samples improves the extraction and verification effects, enabling the system to iterate continuously in long-term use, enhancing its generalization ability and domain adaptability, and improving the overall intelligence level.
[0014] Specifically, in S3, the effective triples of medical knowledge are constructed into a medical knowledge graph based on the multi-source incremental fusion algorithm, which includes: The valid triples of medical knowledge are stored in a graph database. The unique fingerprint of the head entity or tail entity of the valid triples of medical knowledge is calculated as fingerprint=MD5(Project_id+ node_type+ node_name). The unique fingerprint of the relation of the valid triples of medical knowledge is calculated as fingerprint=MD5(Project_id+ rel_source_name + rel_target_name + rel_type). Where Project_id represents a globally unique identifier, node_type represents the type of medical entity node constructed in the corresponding medical knowledge graph for the head entity or tail entity, node_name represents the name of the medical entity node constructed in the corresponding medical knowledge graph for the head entity or tail entity, rel_source_name represents the name of the medical entity node constructed in the corresponding medical knowledge graph for the head entity of the valid triples of medical knowledge, rel_target_name represents the name of the medical entity node constructed in the corresponding medical knowledge graph for the tail entity of the valid triples of medical knowledge, and rel_type represents the relation type of the medical relation edge constructed in the corresponding medical knowledge graph for the relation of the valid triples of medical knowledge. Collision detection with complexity is performed using the index of the graph database. When a collision is detected between the unique fingerprints of entities or relationships, incremental fusion is performed on the attributes of the medical entity nodes or medical relationship edges constructed in the medical knowledge graph. This includes source index fusion, appending of evidence medical text, and aggregation of location information. When no collision is detected, new medical entity nodes or medical relationship edges are created. Independent fingerprints and collision detection for entities and medical relationship edges are used to achieve incremental fusion rather than overwriting or discarding. All source, evidence, and location information are retained, achieving zero information loss in the medical knowledge graph, supporting end-to-end traceability, and improving the integrity and traceability of the graph.
[0015] Specifically, it also includes S5, which assigns an initial confidence score to each valid triple of medical knowledge, the initial confidence score being determined based on the cascaded validation path of the valid triple of medical knowledge; Traverse all medical entity nodes in the medical knowledge graph. For each medical entity node, extract all valid medical knowledge triples directly connected to the medical entity node to form an adjacency triple set for each medical entity node. Detect whether there are triple pairs sharing the medical entity node in the adjacency triple set. The triple pair includes a first triple and a second triple, wherein the first triple connects the medical entity node to a first adjacent entity node, and the second triple connects the medical entity node to a second adjacent entity node. There is an intermediate path between the first adjacent entity node and the second adjacent entity node, and the length of the intermediate path is within a preset range. When the triplet pair is detected, the medical entity node, the first triplet, the intermediate path, and the second triplet are combined to form a local reasoning structure. The initial confidence score of the valid medical knowledge triplet is iteratively updated according to the type of the local reasoning structure to obtain the final confidence score.
[0016] By constructing a local reasoning structure consisting of medical entity nodes, triple pairs, and their mediating paths, the originally isolated triple confidence scores are associated as logical units in the graph structure, providing a structural basis for the propagation and collaborative correction of confidence scores in the graph.
[0017] Specifically, the iterative update of the initial confidence score of the effective triples of medical knowledge based on the type of local reasoning structure to obtain the final confidence score includes: For the local reasoning structure, obtain the relation type of the first triplet and the relation type of the second triplet, and obtain all relation types on the mediation path; The direct medical influence direction is determined based on the relationship type of the first triplet, the second medical influence direction is determined based on the relationship type of the second triplet, and it is determined whether the influence directions of all relationship types on the mediation path are consistent. If they are consistent, the indirect medical influence direction is equal to the second medical influence direction; otherwise, the indirect medical influence direction is marked as unreliable. Based on whether the directions of direct medical influence, indirect medical influence, and the influence directions of all relation types on the mediation path are consistent, the local reasoning structure is classified into contradictory structure, feasible structure, or undetermined structure. For each local reasoning structure, a correction factor is assigned to the corresponding triple according to its classification. A comprehensive correction coefficient is calculated based on the mean of all correction factors corresponding to each valid medical knowledge triple in the same iteration. The confidence score of the triple is updated according to the comprehensive correction coefficient until the iteration stopping condition is met. The confidence score at the time of iteration stopping is taken as the final confidence score.
[0018] Based on relation types, the local reasoning structure is classified as contradictory, feasible, or pending confirmation, and the confidence level is iteratively updated to achieve accurate self-consistent optimization of the knowledge graph.
[0019] According to one aspect of the present invention, an automated medical knowledge graph construction system applying the method described in any one of the first aspects is proposed, comprising the following modules: The medical knowledge candidate triple extraction module is configured to read medical text and perform medical text segmentation, and call the large language model generator to read the segmented medical text to extract a set of medical knowledge candidate triples. The medical knowledge candidate triples include head entity, relation and tail entity. The cascaded verification module is configured to perform cascaded verification on each candidate triplet of the medical knowledge triplet set to obtain valid medical knowledge triplets. The cascaded verification includes a first-level verification and a second-level verification, specifically including: The first-level verification submodule is configured to upload a domain-standard medical glossary as the whitelist for the first-level verification. It uses a hybrid decision algorithm of whitelist short-circuit priority to perform first-level verification on the candidate triples of medical knowledge. The candidate triples of medical knowledge that pass the first-level verification are marked as valid triples of medical knowledge. The hybrid decision algorithm of whitelist short-circuit priority is used to determine whether the candidate triples of medical knowledge match the whitelist. A secondary verification submodule is configured to input the medical knowledge candidate triplet into a large language model discriminator for secondary verification if the candidate triplet fails the primary verification. The large language model discriminator includes a dynamic constraint injection layer that constructs three-dimensional constraint information to build prompt words. Based on the prompt words, the large language model discriminator performs semantic rationality judgment on the medical knowledge candidate triplet to determine whether it is a valid medical knowledge triplet. The three-dimensional constraint information includes context anchor constraint information, industry discrimination standard constraint information, and output protocol constraint information. The medical knowledge graph construction module constructs a medical knowledge graph from the effective triples of medical knowledge based on a multi-source traceability incremental fusion algorithm. The multi-source traceability incremental fusion algorithm is used to incrementally fuse the attributes of medical entity nodes or medical relationship edges constructed in the medical knowledge graph.
[0020] The advantages of this invention are: it achieves fully automated construction of a medical knowledge graph, significantly reducing the cost of calling large language models through a cascaded verification mechanism combined with a three-level matching strategy of whitelist short-circuit priority, balancing verification accuracy and coverage, and effectively reducing the risk of extraction illusions. Dynamic constraint injection and robust protocol parsing improve semantic discrimination accuracy and output stability, preventing unstructured output from affecting the automated process. A multi-source incremental fusion algorithm achieves zero information loss and full-link traceability, ensuring the integrity and credibility of the graph. A closed-loop optimization mechanism continuously iterates model performance, enhancing the system's generalization ability and domain adaptability, thus improving the overall efficiency, quality, and stability of medical knowledge graph construction. Attached Figure Description
[0021] The accompanying drawings are included to provide a further understanding of the embodiments and are incorporated in and constitute a part of this specification. The drawings illustrate embodiments and, together with the description, serve to explain the principles of the invention. Other embodiments and many anticipated advantages of the embodiments will be readily recognized as they become better understood through reference to the following detailed description. Elements in the drawings are not necessarily to scale. The same reference numerals refer to corresponding similar parts.
[0022] Figure 1 A flowchart illustrating an automated medical knowledge graph construction method according to the present invention is shown; Figure 2 A schematic diagram of an automated medical knowledge graph construction system according to the present invention is shown; Figure 3 A schematic diagram of a computer system architecture suitable for implementing the embodiments of this application is shown. Detailed Implementation
[0023] The present application will now be described in further detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and not intended to limit it. Furthermore, it should be noted that, for ease of description, only the parts relevant to the invention are shown in the accompanying drawings.
[0024] It should be noted that, unless otherwise specified, the embodiments and features described in this application can be combined with each other. This application will now be described in detail with reference to the accompanying drawings and embodiments.
[0025] Figure 1 An automated method for constructing a medical knowledge graph is shown, including the following steps: S1, Read the medical text and complete the medical text segmentation, call the large language model generator to read the segmented medical text and extract the medical knowledge candidate triple set, the medical knowledge candidate triple includes head entity, relation and tail entity; S2, perform cascaded validation on each medical knowledge candidate triplet in the medical knowledge candidate triplet set to obtain valid medical knowledge triplets. The cascaded validation includes a first-level validation and a second-level validation, specifically including: S201, Upload the domain standard medical glossary as the whitelist for the first-level verification, and use the hybrid decision algorithm of whitelist short-circuit priority to perform first-level verification on the medical knowledge candidate triples. The medical knowledge candidate triples that pass the first-level verification are marked as the medical knowledge valid triples. The hybrid decision algorithm of whitelist short-circuit priority is used to determine whether the medical knowledge candidate triples hit the whitelist. S202, if the medical knowledge candidate triplet fails the first-level verification, the medical knowledge candidate triplet is input into a large language model discriminator for a second-level verification. The large language model discriminator includes a dynamic constraint injection layer, which constructs three-dimensional constraint information to build prompt words. The large language model discriminator performs semantic rationality judgment on the medical knowledge candidate triplet based on the prompt words to determine whether the medical knowledge candidate triplet is a valid medical knowledge triplet. The three-dimensional constraint information includes context anchor constraint information, industry discrimination standard constraint information, and output protocol constraint information. S3, Based on the multi-source tracing incremental fusion algorithm, the effective triples of medical knowledge are constructed into a medical knowledge graph. The multi-source tracing incremental fusion algorithm is used to incrementally fuse the attributes of medical entity nodes or medical relationship edges constructed in the medical knowledge graph.
[0026] In a specific embodiment, the system configures a target medical knowledge triplet structure (Schema). Users read medical text, segment the text into blocks, and upload a domain-standard medical thesaurus (such as the MeSH medical thesaurus) as a whitelist for primary verification. Simultaneously, the system fine-tunes the PromptTemplate based on business needs, sets the extraction tone, and calls a large language model generator to read the medical text blocks. Based on the Schema instructions, it extracts a set of candidate medical knowledge triplets. Then, it performs cascading verification on the candidate triplet set to filter valid medical knowledge triplets one by one, thereby constructing a medical knowledge graph.
[0027] Specifically, in S201, a hybrid decision-making algorithm combining whitelist and short-circuit priority is used to perform a first-level verification of the candidate triples of medical knowledge, which includes: After normalizing and preprocessing the entity names corresponding to the head or tail entities in the candidate medical knowledge triplet, a first precise match is performed based on the whitelist. If the entity names corresponding to both the head and tail entities in the candidate medical knowledge triplet successfully complete the first precise match, the first-level verification is passed. If it fails, the entity names corresponding to the head and tail entities in the candidate medical knowledge triplet are transformed based on the preloaded domain standard medical thesaurus synonym mapping table and a second precise match is performed. If the entity names corresponding to both the head and tail entities in the candidate medical knowledge triplet successfully complete the second precise match, the first-level verification is passed. If it fails, the similarity between the entity names corresponding to the head and tail entities in the candidate medical knowledge triplet and the whitelist words is calculated based on the similarity formula. If the similarity reaches the preset similarity threshold, the first-level verification is passed. If it fails, the second-level semantic discrimination verification is performed.
[0028] If a match is found, mark it as Valid; if the relation type does not match, attempt rule correction; if a match is found, perform secondary semantic discrimination and verification.
[0029] In a specific embodiment, during initialization, the system parses the user-configured schema into a structured index table and stores it in an in-memory hash map. The system first searches for candidate words in the in-memory hash table, and the whitelist hit determination employs a three-level matching strategy, progressing layer by layer to balance accuracy and fault tolerance: The first level is exact matching, using a hash table for O(1) time complexity lookup. Entity names are first normalized (removing leading and trailing spaces and punctuation, unifying case, and converting between simplified and traditional Chinese characters), and then matched exactly against a whitelist. This level covers approximately 90% of standard medical terms, with a response time of less than 1 millisecond.
[0030] The second level is synonym mapping matching, which is activated when an exact match fails. The system preloads a synonym mapping table extracted from medical standard thesaurus (such as MeSH, ICD-10), converts the entity names to standard forms, and then matches them against the whitelist again. For example, "hypertension", "hypertension", and "primary hypertension" will all be mapped to the standard word "hypertension". This level covers approximately 5% of common variants, with a response time of less than 2 milliseconds.
[0031] The third level is fuzzy matching, which uses the edit distance algorithm (LevenshteinDistance) to calculate the similarity between entity names and white name entries.
[0032] In this specific embodiment, the threshold is set to 0.85. The similarity calculation formula is: 1 minus the edit distance divided by the maximum length of the two words. A threshold of 0.85 can identify minor spelling errors (such as "aspirin" matching "aspirin") while avoiding mismatches (such as "high blood sugar" not being mismatched as "high blood pressure"). This level covers approximately 3% of the edge cases, with a response time of less than 10 milliseconds.
[0033] If none of the three levels of matching are found, the query is determined to be outside the whitelist, and the candidate words are submitted to the LLM discriminator for semantic verification, utilizing its generalization ability to process the remaining long-tail terms. Through this three-level strategy, most queries complete short-circuit determination in a short time, with only a few requiring calls to the LLM, significantly reducing computational costs and eliminating the risk of misleading core medical terms, achieving a verification effect that complements precision and breadth.
[0034] The secondary LLM discriminator is activated only when a match is missed. Its generalization ability is used to process long-tail words, forming a precision-breadth complementary filter. This ensures zero misjudgment and millisecond-level response for core medical terms while also taking into account the semantic recall of new words.
[0035] Specifically, the dynamic constraint injection layer described in S202, which constructs three-dimensional constraint information to build prompt words, includes: Configure the target medical knowledge triple structure and parse the target medical knowledge triple structure into a structured index table. The structured index table contains a set of entity types and a set of relation types. Based on the structured index table, extract the entity types of the head entity and tail entity in the candidate medical knowledge triple and the relation types of the relation, and perform type constraints to construct the context anchor constraint information. The entity names corresponding to the head or tail entities in the candidate triples of medical knowledge are constrained to be standardized medical terms. The relationship between the head and tail entities must conform to medical logic. The head, tail entities and relationships are also constrained to be explicitly recorded in the medical text in a literal or semantically equivalent form to construct the industry discrimination standard constraint information. A mandatory JSON output protocol is defined, requiring the output of the large language model discriminator to include a valid boolean identifier field and a string-type judgment criterion field to construct the output protocol constraint information. A three-dimensional constraint vector is dynamically loaded based on candidate words and injected into the system prompt words. Prompt words are accurately constructed through three types of constraints: context anchors, industry three-dimensional standards, and output protocol, limiting the LLM discrimination space and reducing irrelevant output and logical errors. This makes the discrimination more aligned with domain rules, improves validation consistency and credibility, and reduces the probability of illusions and misjudgments.
[0036] In a specific embodiment, the large language model discriminator of the invention is not a simple question-and-answer call as in the prior art, but a dedicated software module that encapsulates a constraint injection layer and a protocol parsing layer. The network interaction structure of the dynamic constraint injection layer is as follows: Before the call, three-dimensional constraint information is dynamically loaded based on candidate words and injected into the system prompt word SystemPrompt. The three-dimensional constraint information includes the following dimensions: ①Context Dimension: Injecting the current context anchor (e.g., node type). The extraction of context anchors uses a dynamic anchoring algorithm based on schema mapping. The specific process is as follows: During initialization, the system parses the user-configured schema (e.g., {entity type: ["disease", "drug", "symptom"], relation type: ["treatment", "induction", "relief"]}) into a structured index table and stores it in an in-memory hash map. When the Generator generates medical knowledge candidate triples (head entity head_entity, relation relationship, tail entity tail_entity), the system extracts the head entity type head_entity.type and tail entity type tail_entity.type fields.
[0037] In one embodiment, the medical knowledge candidate triple is ("hypertension", "treatment", "amlodipine", {"head_type": "disease", "relation": "treatment", "tail_type": "drug"}), and the anchor injection logic uses the extracted node type as the context anchor to dynamically construct the system prompt word SystemPrompt for the discriminator Validator. You are validating a medical knowledge triplet.
[0038] Context anchor: -Head entity type: Disease -Relationship Type: Treatment - Tail entity type: Drug Based on the above type constraints, please determine whether the ternary combination ("hypertension", "treatment", "amlodipine") is reasonable.
[0039] ② Criteria dimension: Injecting fixed industry-standard criteria (i.e., the three-dimensional operators of "accuracy, medical context, and completeness") to define the logical reasoning space of the model. These three-dimensional operators are injected into the Prompt in the form of medical text rules. An example of one implementation is shown below: The three-dimensional standard injected into LLMValidator: Please verify the triplet according to the following three-dimensional standard: I. Accuracy: Entities must be standard medical terms and colloquialisms must not be used (e.g., "high blood pressure" should be "hypertension"); II. Medical Context: Relationships must conform to medical logic (e.g., aspirin is an antiplatelet drug, so you cannot say "treats hypertension"); III. Completeness: Entities and relationships must be clearly documented in the original text and cannot be inferred.
[0040] In one embodiment, the verification example is as follows. Original text: "Patient has high blood pressure and takes aspirin to prevent thrombosis", medical knowledge candidate triple: ("high blood pressure", "treatment", "aspirin").
[0041] LLM judgment: json { "valid": false, "reason": "Violates medical context: Aspirin is used to treat platelets, not to treat blood pressure." “violated”: in a “medical context” }
[0042] Three-dimensional operators enable LLM to identify pharmacological logic errors and reject hallucinatory triples.
[0043] ③Protocol dimension: Mandate defining the output protocol (JSONSchema), requiring the model output to include a valid identifier field of boolean type valid(Boolean) and a judgment basis field of string type reason(String), transforming the non-deterministic generation task into a deterministic classification task.
[0044] All three types of constraint information are constructed as prompt words in natural language form, and are passed as input to the API interface of the general-purpose large language model along with the candidate triples to be verified. Internally, the input text is tokenized by a tokenizer and then transformed into a high-dimensional vector through an embedding layer. Subsequently, a multi-head self-attention mechanism in a multi-layer Transformer Decoder "focuses" on the constraints carried in the prompt words within a context window. Utilizing the medical domain language understanding capabilities learned during pre-training, the validity of the candidate triples is inferred and judged. Finally, the language model head (LMHead) generates a judgment result conforming to the JSON output protocol (including boolean validity identifiers and string judgment criteria) on a token-by-token basis in an autoregressive manner. This invention, by systematically constructing a constraint prompt word strategy, designs cascaded verification to reduce calling costs, dynamic constraint injection to suppress illusions, and a robust protocol parsing layer to tolerate non-standard output, transforming the general-purpose large language model into a domain-specific high-precision knowledge graph verification tool.
[0045] Specifically, the large language model discriminator also includes a robust protocol parsing layer. This layer performs a first JSON format parsing and field validation. If both parsing and validation are successful, it outputs a successful parsing result. If it fails, the robust protocol parsing layer uses regular expressions to locate the JSON boundaries of the failed parsing result, extracts the content enclosed in the outermost brackets, and performs a second JSON format parsing and field validation. If both parsing and validation are successful, it outputs a successful parsing result. If it fails, it activates an exception circuit breaker mechanism, uses keyword semantic analysis to statistically analyze the positive or negative tendency of the failed parsing result, and adopts a conservative strategy to output the validation result. When a clear judgment cannot be made, it defaults to returning a result of valid as True.
[0046] When JSON parsing fails, if the original text returned by LLM starts with "VALID", it is considered positive (returns valid); if it starts with "INVALID", it is considered negative (returns invalid). If neither matches or any exception occurs, it is considered that it cannot be clearly determined. At this time, the exception circuit breaker mechanism is triggered, and the current code returns valid (True) by default to avoid mistakenly killing valid triples.
[0047] In a specific implementation, a dedicated parsing algorithm is deployed at the model output. Without this parsing layer, ordinary model output cannot be directly used by the code logic. This layer includes a JSON boundary location algorithm and an exception circuit breaker mechanism (for handling non-standard outputs), ensuring that deterministic Boolean results can still be extracted even when the model generates unstructured medical text, guaranteeing the stability of the automated pipeline. For example, when the LLM outputs unstructured medical text such as "This triple is invalid because aspirin is not a blood pressure medication," the algorithm detects the keyword "invalid" and automatically extracts it as {"valid": false, "reason": "The conclusion is drawn through semantic extraction"}. This design allows the system to tolerate the non-deterministic output of the LLM, converting it into the deterministic Boolean values required by the automated pipeline, ensuring the stability and robustness of the entire medical knowledge graph construction process. All original outputs that fail to parse are retained for manual review, ensuring the traceability of boundary cases.
[0048] In a specific embodiment, addressing the information loss problem caused by simple "deduplication and discard" or "overwrite update" in existing technologies, this invention employs an incremental fusion algorithm based on atomic transactions in graph databases: In S3, the effective triples of medical knowledge are constructed into a medical knowledge graph based on a multi-source tracing incremental fusion algorithm, specifically including: Store the valid triples of the medical knowledge in a graph database. Calculate the unique fingerprint of the head or tail entity of each valid triple: fingerprint = MD5(Project_id + node_type + node_name). Calculate the unique fingerprint of the relation of each valid triple: fingerprint = MD5(Project_id + rel_source_name + rel_target_name + ... rel_type), where Project_id represents a globally unique identifier, node_type represents the type of medical entity node constructed in the corresponding medical knowledge graph for the head or tail entity, node_name represents the name of the medical entity node constructed in the corresponding medical knowledge graph for the head or tail entity, i.e., the normalized entity name of the head or tail entity; rel_source_name represents the name of the medical entity node constructed in the corresponding medical knowledge graph for the head entity of the valid medical knowledge triple, i.e., the normalized entity name of the head entity; rel_target_name represents the name of the medical entity node constructed in the corresponding medical knowledge graph for the tail entity of the valid medical knowledge triple, i.e., the normalized entity name of the tail entity; rel_type represents the relation type of the medical relation edge constructed in the corresponding medical knowledge graph for the relation of the valid medical knowledge triple. Collision detection of complexity is performed using the index of the graph database. When a collision is detected between the unique fingerprints of entities or the unique fingerprints of the relationships, incremental fusion is performed on the attributes of the medical entity nodes or medical relationship edges constructed by the medical knowledge graph, including source index fusion, appending of evidence medical text, and aggregation of location information. When no collision is detected, new medical entity nodes or medical relationship edges are created.
[0049] The Project_id is configured by the user during system initialization and serves as a globally unique identifier for medical knowledge graph projects (e.g., "CardiovascularKG_2024"). All documents imported in the same batch share the same Project_id, ensuring correct merging of entities across documents. Entities with different Project_ids will generate different fingerprints even if they have the same name, achieving multi-project isolation.
[0050] Specifically, before calculating the unique fingerprint of the head entity or tail entity, a synonym preprocessing layer is used to uniformly map different variants of the entity name of the head entity or tail entity to the standard form in the domain standard medical thesaurus to obtain the corresponding normalized entity name; The source index fusion specifically includes: appending the new source file name to the source file name (source_file) field of the medical entity node; The appending of medical evidence specifically includes appending new contextual evidence, separated by newline characters, to the evidence field of the medical entity node. The location information aggregation specifically includes: listing and merging the new medical text block index and page index into the location information field of the medical entity node.
[0051] In one embodiment, there exists a series of entity names with different forms but the same semantics (e.g., "hypertension" vs. "hypertensive disease" vs. "Hypertension"). These are uniformly mapped before fingerprint calculation through a synonym preprocessing layer. The system preloads a thesaurus from a standard medical thesaurus, normalizing all variants into standard forms so that different expressions generate the same fingerprint, thereby triggering fusion. During fusion, evidence medical texts, file names, and location information from multiple sources are all appended to the same node.
[0052] In one embodiment, there exists a series of relationships with the same entity name but different relationship types (e.g., the same "hypertension" entity can have both "inducing stroke" and "treating amlodipine"). Therefore, a strategy of storing entities and relationships independently is adopted. Medical entity node fingerprints are based solely on MD5(Project_id + node_type + node_name), while relationship fingerprints are based on MD5(Project_id + rel_source_name + rel_target_name + rel_type). Thus, medical entity nodes are merged (source information is combined), but edges of different relationship types are created separately, preserving the complete knowledge network structure and achieving zero information loss.
[0053] Specifically, this also includes S4, which collects parsing failure results during the cascading verification process as boundary cases for manual review, and feeds back the correct samples after manual review to the construction of the prompt words for closed-loop optimization. By collecting boundary cases and manually reviewing feedback, a closed-loop optimization mechanism is formed. Continuously supplementing high-quality samples improves the extraction and verification effects, enabling the system to iterate continuously in long-term use, enhancing its generalization ability and domain adaptability, and improving the overall intelligence level.
[0054] In a specific embodiment, it also includes S5, assigning an initial confidence score to each valid medical knowledge triplet. The initial confidence score is determined according to the cascaded verification path of the valid medical knowledge triplet. For example, a first confidence score is assigned to a valid medical knowledge triplet that passes the first-level verification, and a second confidence score is assigned to a triplet that passes the second-level verification.
[0055] Traverse all medical entity nodes in the medical knowledge graph. For each medical entity node, extract all valid medical knowledge triples directly connected to the medical entity node to form an adjacency triple set for each medical entity node. Detect whether there are triple pairs sharing the medical entity node in the adjacency triple set. The triple pair includes a first triple and a second triple, wherein the first triple connects the medical entity node to a first adjacent entity node, and the second triple connects the medical entity node to a second adjacent entity node. There is an intermediate path between the first adjacent entity node and the second adjacent entity node, and the length of the intermediate path is within a preset range. When the triplet pair is detected, the medical entity node, the first triplet, the intermediate path, and the second triplet are combined to form a local reasoning structure. The initial confidence score of the valid medical knowledge triplet is iteratively updated according to the type of the local reasoning structure to obtain the final confidence score.
[0056] Specifically, the iterative update of the initial confidence score of the effective triples of medical knowledge based on the type of local reasoning structure to obtain the final confidence score includes: For the local reasoning structure, obtain the relation type of the first triplet and the relation type of the second triplet, and obtain all relation types on the mediation path; The direct medical influence direction is determined based on the relationship type of the first triplet, the second medical influence direction is determined based on the relationship type of the second triplet, and it is determined whether the influence directions of all relationship types on the mediation path are consistent. If they are consistent, the indirect medical influence direction is equal to the second medical influence direction; otherwise, the indirect medical influence direction is marked as unreliable. Based on whether the directions of direct medical influence, indirect medical influence, and influence of all relation types on the mediation path are consistent, the local reasoning structure is classified into contradictory structure, feasible structure, or undetermined structure.
[0057] In one embodiment, if the direction of influence of all relation types on the mediation path is consistent, then the direction of indirect medical influence is equal to the direction of second medical influence, and the mediation path is marked as reasonable transmission. If there are relationship types with inconsistent influence directions on the intermediary path, the intermediary path is marked as unreasonable transmission; When the mediating path is a reasonable transmission, if the direction of the direct medical influence is the same as the direction of the indirect medical influence, then the local reasoning structure is marked as a feasible structure. If the direction of the direct medical influence is opposite to the direction of the indirect medical influence, then the local reasoning structure is marked as a contradictory structure. When the intermediate path is an unreasonable transmission, regardless of whether the direction of the direct medical influence is the same as or opposite to the direction of the indirect medical influence, the local reasoning structure is marked as a structure to be determined.
[0058] For each local reasoning structure, a correction proposal for the corresponding triple is generated according to its classification. All correction proposals received by each valid triple of medical knowledge in the same round of iteration are collected, and the corresponding correction factor is assigned according to the correction proposal.
[0059] For local reasoning structures marked as contradictory structures, the confidence scores of the first and second triples are reduced by assigning a low-value correction factor. For local reasoning structures marked as feasible structures, the confidence scores of the first and second triples are increased by assigning a high-value correction factor. For local reasoning structures marked as undetermined structures, the correction factor for the first and second triples is set to 1.
[0060] The transmissibility of medical causality decreases sharply and non-linearly with increasing path length. Therefore, the weight of the correction proposal is set to be inversely proportional to the length of the corresponding intermediate path, and the length of the intermediate path is limited to prevent the noise of the long path, which is absolutely dominant, from completely drowning out the effective signal of the short path.
[0061] In one embodiment, for a feasible structure, a positive correction proposal is generated with a correction factor of 1 + α × (1 - L / 4), where α is a preset gain coefficient (0 ≤ L ≤ 4 ≤ 1 ... (α≤0.2), where L is the length of the intermediate path; For contradictory structures, a negative correction proposal is generated, with a correction factor of 1-β×(1-L / 4), where β is a preset penalty coefficient (0 < 0). β≤0.3).
[0062] In each training round, the mean of all correction factors for each triplet is calculated to obtain the comprehensive correction coefficient. Based on the comprehensive correction coefficient, the confidence score of each triplet is updated in each iteration until the iteration stopping condition is met (such as the ratio of the initial confidence score to the current iteration confidence score meeting a threshold or reaching the number of iterations). The confidence score at the time of iteration stopping is taken as the final confidence score.
[0063] In one embodiment, taking the medical entity node "Aspirin" as an example, its set of adjacent triples includes the first triple "Aspirin—Treatment—Angina" and the second triple "Aspirin—Inhibition—Platelet Aggregation". Graph lookup reveals an intermediate path from the second adjacent entity node "Platelet Aggregation" to the first adjacent entity node "Angina". This path consists of three triples connected in series: "Platelet Aggregation—Induction—Thrombosis", "Thrombosis—Induction—Myocardial Infarction", and "Myocardial Infarction—Induction—Angina". The system defines the direction of influence for each relationship: "Treatment" is positive, "Inhibition" is positive, and "Induction" is negative. Since the direction of influence for all relationships on the intermediate path is negative and consistent, the indirect medical influence direction is equal to the direction of influence for the second triple, which is positive. At this point, the direct medical influence direction is positive, and the indirect medical influence direction is also positive; both are the same. According to the classification rules, when the influence direction of all relations on the mediation path is consistent, regardless of whether the direct and indirect medical influence directions are the same, the local reasoning structure is marked as a feasible structure. The system then performs confidence gain enhancement on the first and second triples in this structure.
[0064] According to one aspect of the present invention, an automated medical knowledge graph construction system applying the method described in any one of the first aspects is proposed, comprising the following modules: The medical knowledge candidate triple extraction module M1 is configured to read medical text and complete medical text segmentation, and call the large language model generator to read the segmented medical text to extract a set of medical knowledge candidate triples. The medical knowledge candidate triples include head entity, relation and tail entity. The cascaded verification module M2 is configured to perform cascaded verification on each candidate triplet of medical knowledge in the set of candidate triplets of medical knowledge to obtain valid triplets of medical knowledge. The cascaded verification includes a first-level verification and a second-level verification, specifically including: The first-level verification submodule M201 is configured to upload a domain-standard medical glossary as the whitelist for the first-level verification. It uses a hybrid decision algorithm of whitelist short-circuit priority to perform first-level verification on the candidate triples of medical knowledge. The candidate triples of medical knowledge that pass the first-level verification are marked as valid triples of medical knowledge. The hybrid decision algorithm of whitelist short-circuit priority is used to determine whether the candidate triples of medical knowledge match the whitelist. The secondary verification submodule M202 is configured to input the medical knowledge candidate triplet into a large language model discriminator for secondary verification if the candidate triplet fails the primary verification. The large language model discriminator includes a dynamic constraint injection layer that constructs three-dimensional constraint information to build prompt words. Based on the prompt words, the large language model discriminator performs semantic rationality judgment on the medical knowledge candidate triplet to determine whether it is a valid medical knowledge triplet. The three-dimensional constraint information includes context anchor constraint information, industry discrimination standard constraint information, and output protocol constraint information. The medical knowledge graph construction module M3 constructs a medical knowledge graph from the effective triples of medical knowledge based on the multi-source traceability incremental fusion algorithm. The multi-source traceability incremental fusion algorithm is used to incrementally fuse the attributes of the medical entity nodes or medical relationship edges constructed in the medical knowledge graph.
[0065] The following is for reference. Figure 3 It shows a schematic diagram of the structure of a computer system 300 suitable for implementing electronic devices according to embodiments of the present application. Figure 3 The electronic device shown is merely an example and should not impose any limitation on the functionality and scope of use of the embodiments of this application.
[0066] like Figure 3 As shown, the computer system 300 includes a central processing unit (CPU) 301, which performs various appropriate actions and processes based on programs stored in read-only memory (ROM) 302 or programs loaded from storage section 309 into random access memory (RAM) 304. RAM 304 also stores various programs and data required for the operation of system 300. CPU 301, ROM 302, ROM 303, and RAM 304 are interconnected via bus 305. Input / output (I / O) interface 306 is also connected to bus 305.
[0067] The following components are connected to I / O interface 306: an input section 307 including a keyboard, mouse, etc.; an output section 308 including a liquid crystal display (LCD) and speakers, etc.; a storage section 309 including a hard disk, etc.; and a communication section 310 including a network interface card such as a LAN card and a modem, etc. The communication section 310 performs communication processing via a network such as the Internet. A drive 311 is also connected to I / O interface 306 as needed. A removable medium 312, such as a disk, optical disk, magneto-optical disk, semiconductor memory, etc., is installed on drive 311 as needed so that computer programs read from it can be installed into storage section 309 as needed.
[0068] Specifically, according to embodiments of this disclosure, the processes described above with reference to the flowcharts are implemented as computer software programs. For example, embodiments of this disclosure include a computer program product comprising a computer program carried on a computer-readable storage medium, the computer program containing program code for performing the methods shown in the flowcharts. In such embodiments, the computer program is downloaded and installed from a network via communication section 310, and / or installed from removable medium 312. When the computer program is executed by central processing unit (CPU) 301, it performs the functions defined in the methods of this application.
[0069] It should be noted that the computer-readable storage medium of this application is a computer-readable signal medium or a computer-readable storage medium, or any combination thereof. A computer-readable storage medium is, for example—but not limited to—an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of a computer-readable storage medium include, but are not limited to: an electrical connection having one or more wires, a portable computer disk, a hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination thereof. In this application, a computer-readable storage medium is any tangible medium that contains or stores a program used by or in connection with an instruction execution system, apparatus, or device. In this application, a computer-readable signal medium includes a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code. Such propagated data signals take various forms, including, but not limited to, electromagnetic signals, optical signals, or any suitable combination thereof. The computer-readable signal medium or any computer-readable storage medium other than a computer-readable storage medium may transmit, propagate, or transfer a program for use by or in connection with an instruction execution system, apparatus, or device. The program code contained on the computer-readable storage medium may be transmitted using any suitable medium, including but not limited to: wireless, wireline, optical fiber, RF, etc., or any suitable combination thereof.
[0070] Computer program code for performing the operations of this application is written in one or more programming languages or a combination thereof, including object-oriented programming languages such as Java, Smalltalk, and C++, and conventional procedural programming languages such as the "C" language or similar programming languages. The program code executes entirely on the user's computer, partially on the user's computer, as a standalone software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In cases involving a remote computer, the remote computer is connected to the user's computer via any type of network—including a local area network (LAN) or a wide area network (WAN)—or connected to an external computer (e.g., via the Internet using an Internet service provider).
[0071] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of this application. In this regard, each block in a flowchart or block diagram represents a module, segment, or portion of code containing one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions indicated in the blocks occur in a different order than those indicated in the drawings. For example, two consecutively indicated blocks may actually execute substantially in parallel, and they may sometimes execute in reverse order, depending on the functions involved. It should also be noted that each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, is implemented using a dedicated hardware-based system that performs the specified function or operation, or using a combination of dedicated hardware and computer instructions.
[0072] The modules described in the embodiments of this application are implemented in software or hardware.
[0073] On the other hand, this application also provides a computer-readable storage medium, which is included in the electronic device described in the above embodiments; it also exists independently and is not assembled into the electronic device. The computer-readable storage medium carries one or more programs, which, when executed by the electronic device, cause the electronic device to: S1, read medical text and complete medical text segmentation, call a large language model generator to read the segmented medical text and extract a set of medical knowledge candidate triples, wherein the medical knowledge candidate triples include a head entity, a relation, and a tail entity; S2, perform cascaded verification on each medical knowledge candidate triple in the set of medical knowledge candidate triples to obtain valid medical knowledge triples, wherein the cascaded verification includes a first-level verification and a second-level verification, specifically including: S201, uploading a domain-standard medical lexicon as a whitelist for the first-level verification, using a hybrid decision algorithm of whitelist short-circuit priority to perform a first-level verification on the medical knowledge candidate triples, and marking the medical knowledge candidate triples that pass the first-level verification as valid medical knowledge triples, wherein the hybrid decision algorithm of whitelist short-circuit priority is used to determine... S202, if the medical knowledge candidate triplet does not pass the first-level verification, the medical knowledge candidate triplet is input into a large language model discriminator for second-level verification. The large language model discriminator includes a dynamic constraint injection layer, which constructs three-dimensional constraint information to construct prompt words. The large language model discriminator performs semantic rationality judgment on the medical knowledge candidate triplet based on the prompt words to determine whether the medical knowledge candidate triplet is a valid medical knowledge triplet. The three-dimensional constraint information includes context anchor constraint information, industry discrimination standard constraint information, and output protocol constraint information. S3, the valid medical knowledge triplet is constructed into a medical knowledge graph based on a multi-source traceability incremental fusion algorithm. The multi-source traceability incremental fusion algorithm is used to incrementally fuse the attributes of medical entity nodes or medical relationship edges constructed in the medical knowledge graph.
[0074] The above description is merely a preferred embodiment of this application and an explanation of the technical principles employed. Those skilled in the art should understand that the scope of the invention involved in this application is not limited to technical solutions formed by specific combinations of the above-described technical features, but should also cover other technical solutions formed by arbitrary combinations of the above-described technical features or their equivalents without departing from the above-described inventive concept. For example, technical solutions formed by substituting the above features with (but not limited to) technical features with similar functions disclosed in this application.
Claims
1. An automated method for constructing a medical knowledge graph, characterized in that, Includes the following steps: S1, Read the medical text and complete the medical text segmentation, call the large language model generator to read the segmented medical text and extract the medical knowledge candidate triple set, the medical knowledge candidate triple set includes head entity, relation and tail entity; S2, perform cascaded validation on each medical knowledge candidate triplet in the medical knowledge candidate triplet set to obtain valid medical knowledge triplets. The cascaded validation includes a first-level validation and a second-level validation, specifically including: S201, Upload the domain standard medical terminology as the whitelist for the first-level verification, and use the hybrid decision algorithm of whitelist short-circuit priority to perform first-level verification on the medical knowledge candidate triples. The medical knowledge candidate triples that pass the first-level verification are marked as valid medical knowledge triples. The hybrid decision algorithm of whitelist short-circuit priority is used to determine whether the medical knowledge candidate triples hit the whitelist. S202, if the medical knowledge candidate triplet fails the first-level verification, the medical knowledge candidate triplet is input into a large language model discriminator for a second-level verification. The large language model discriminator includes a dynamic constraint injection layer, which constructs three-dimensional constraint information to build prompt words. The large language model discriminator performs semantic rationality judgment on the medical knowledge candidate triplet based on the prompt words to determine whether the medical knowledge candidate triplet is a valid medical knowledge triplet. The three-dimensional constraint information includes context anchor constraint information, industry discrimination standard constraint information, and output protocol constraint information. S3, Based on the multi-source tracing incremental fusion algorithm, the effective triples of medical knowledge are constructed into a medical knowledge graph. The multi-source tracing incremental fusion algorithm is used to incrementally fuse the attributes of medical entity nodes or medical relationship edges constructed in the medical knowledge graph.
2. The method for constructing an automated medical knowledge graph according to claim 1, characterized in that, The entity types of the head or tail entity include at least disease, drug, and symptom, and the relationship types of the relationship include at least treatment, induction, relief, and inhibition.
3. The method for constructing an automated medical knowledge graph according to claim 2, characterized in that, The dynamic constraint injection layer described in S202, which constructs three-dimensional constraint information to build prompt words, specifically includes: Configure the target medical knowledge triple structure and parse the target medical knowledge triple structure into a structured index table. The structured index table contains a set of entity types and a set of relation types. Based on the structured index table, extract the entity types of the head entity and tail entity in the candidate medical knowledge triple and the relation types of the relation, and perform type constraints to construct the context anchor constraint information. The entity names corresponding to the head or tail entities in the candidate triples of medical knowledge are constrained to be standardized medical terms. The relationship between the head and tail entities must conform to medical logic. The head, tail entities and relationships are also constrained to be explicitly recorded in the medical text in a literal or semantically equivalent form to construct the industry discrimination standard constraint information. A mandatory JSON output protocol is defined, and the output of the large language model discriminator must include a valid boolean identifier field and a string-type decision criterion field to construct the output protocol constraint information.
4. The method for constructing an automated medical knowledge graph according to claim 3, characterized in that, The large language model discriminator also includes a robust protocol parsing layer. This layer performs a first JSON format parsing and field validation. If both parsing and validation are successful, a successful parsing result is output. If it fails, the robust protocol parsing layer uses regular expressions to locate the JSON boundaries of the failed parsing result, extracts the content enclosed in the outermost brackets, and performs a second JSON format parsing and field validation. If both parsing and validation are successful, a successful parsing result is output. If it fails, an exception circuit breaker mechanism is activated. Keyword semantic analysis is used to statistically analyze the positive or negative tendency of the failed parsing result. When a clear judgment cannot be made, a parsing result with a valid identifier field set to True is returned by default.
5. The method for constructing an automated medical knowledge graph according to claim 4, characterized in that, It also includes S4, which collects the parsing failure results in the cascading verification process as boundary cases for manual review and feeds back the correct samples after manual review to the construction of the prompt words for closed-loop optimization.
6. The method for constructing an automated medical knowledge graph according to claim 1, characterized in that, In S201, a hybrid decision-making algorithm prioritizing whitelists and short-circuit optimization is used to perform a first-level verification of the candidate triples of medical knowledge, specifically including: After normalizing and preprocessing the entity names corresponding to the head or tail entities in the candidate medical knowledge triplet, a first precise match is performed based on the whitelist. If the entity names corresponding to both the head and tail entities in the candidate medical knowledge triplet successfully complete the first precise match, the first-level verification is passed. If it fails, the entity names corresponding to the head and tail entities in the candidate medical knowledge triplet are transformed based on the preloaded domain standard medical thesaurus synonym mapping table and a second precise match is performed. If the entity names corresponding to both the head and tail entities in the candidate medical knowledge triplet successfully complete the second precise match, the first-level verification is passed. If it fails, the similarity between the entity names corresponding to the head and tail entities in the candidate medical knowledge triplet and the whitelist words is calculated based on the similarity formula. If the similarity reaches the preset similarity threshold, the first-level verification is passed. If it fails, the second-level semantic discrimination verification is performed.
7. The method for constructing an automated medical knowledge graph according to claim 1, characterized in that, In S3, the effective triples of medical knowledge are constructed into a medical knowledge graph based on the multi-source traceability incremental fusion algorithm, specifically including: The valid triples of medical knowledge are stored in a graph database. The unique fingerprint of the head entity or tail entity of the valid triples of medical knowledge is calculated as fingerprint=MD5(Project_id+ node_type+ node_name). The unique fingerprint of the relation of the valid triples of medical knowledge is calculated as fingerprint=MD5(Project_id+ rel_source_name + rel_target_name + rel_type). Where Project_id represents a globally unique identifier, node_type represents the type of medical entity node constructed in the corresponding medical knowledge graph for the head entity or tail entity, node_name represents the name of the medical entity node constructed in the corresponding medical knowledge graph for the head entity or tail entity, rel_source_name represents the name of the medical entity node constructed in the corresponding medical knowledge graph for the head entity of the valid triples of medical knowledge, rel_target_name represents the name of the medical entity node constructed in the corresponding medical knowledge graph for the tail entity of the valid triples of medical knowledge, and rel_type represents the relation type of the medical relation edge constructed in the corresponding medical knowledge graph for the relation of the valid triples of medical knowledge. Collision detection of complexity is performed using the index of the graph database. When a collision is detected between the unique fingerprints of entities or the unique fingerprints of the relationships, incremental fusion is performed on the attributes of the medical entity nodes or medical relationship edges constructed by the medical knowledge graph, including source index fusion, appending of evidence medical text, and aggregation of location information. When no collision is detected, new medical entity nodes or medical relationship edges are created.
8. The method for constructing an automated medical knowledge graph according to claim 1, characterized in that, It also includes S5, which assigns an initial confidence score to each valid triple of medical knowledge, the initial confidence score being determined based on the cascaded validation path of the valid triple of medical knowledge; Traverse all medical entity nodes in the medical knowledge graph. For each medical entity node, extract all valid medical knowledge triples directly connected to the medical entity node to form an adjacency triple set for each medical entity node. Detect whether there are triple pairs sharing the medical entity node in the adjacency triple set. The triple pair includes a first triple and a second triple, wherein the first triple connects the medical entity node to a first adjacent entity node, and the second triple connects the medical entity node to a second adjacent entity node. There is an intermediate path between the first adjacent entity node and the second adjacent entity node, and the length of the intermediate path is within a preset range. When the triplet pair is detected, the medical entity node, the first triplet, the intermediate path, and the second triplet are combined to form a local reasoning structure. The initial confidence score of the valid medical knowledge triplet is iteratively updated according to the type of the local reasoning structure to obtain the final confidence score.
9. The method for constructing an automated medical knowledge graph according to claim 8, characterized in that, The initial confidence score of the effective triplet of medical knowledge is iteratively updated according to the type of local reasoning structure to obtain the final confidence score, specifically including: For the local reasoning structure, obtain the relation type of the first triplet and the relation type of the second triplet, and obtain all relation types on the mediation path; The direct medical influence direction is determined based on the relationship type of the first triplet, the second medical influence direction is determined based on the relationship type of the second triplet, and it is determined whether the influence directions of all relationship types on the mediation path are consistent. If they are consistent, the indirect medical influence direction is equal to the second medical influence direction; otherwise, the indirect medical influence direction is marked as unreliable. Based on whether the directions of direct medical influence, indirect medical influence, and the influence directions of all relation types on the mediation path are consistent, the local reasoning structure is classified into contradictory structure, feasible structure, or undetermined structure. For each local reasoning structure, a correction factor is assigned to the corresponding triple according to its classification. A comprehensive correction coefficient is calculated based on the mean of all correction factors corresponding to each valid medical knowledge triple in the same iteration. The confidence score of the triple is updated according to the comprehensive correction coefficient until the iteration stopping condition is met. The confidence score at the time of iteration stopping is taken as the final confidence score.
10. An automated medical knowledge graph construction system applying the method described in any one of claims 1 to 9, characterized in that, Includes the following modules: The medical knowledge candidate triple extraction module is configured to read medical text and perform medical text segmentation, and call the large language model generator to read the segmented medical text to extract a set of medical knowledge candidate triples. The medical knowledge candidate triples include head entity, relation and tail entity. The cascaded verification module is configured to perform cascaded verification on each candidate triplet of the medical knowledge triplet set to obtain valid medical knowledge triplets. The cascaded verification includes a first-level verification and a second-level verification, specifically including: The first-level verification submodule is configured to upload a domain-standard medical glossary as the whitelist for the first-level verification. It uses a hybrid decision algorithm of whitelist short-circuit priority to perform first-level verification on the candidate triples of medical knowledge. The candidate triples of medical knowledge that pass the first-level verification are marked as valid triples of medical knowledge. The hybrid decision algorithm of whitelist short-circuit priority is used to determine whether the candidate triples of medical knowledge match the whitelist. A secondary verification submodule is configured to input the medical knowledge candidate triplet into a large language model discriminator for secondary verification if the candidate triplet fails the primary verification. The large language model discriminator includes a dynamic constraint injection layer that constructs three-dimensional constraint information to build prompt words. Based on the prompt words, the large language model discriminator performs semantic rationality judgment on the medical knowledge candidate triplet to determine whether it is a valid medical knowledge triplet. The three-dimensional constraint information includes context anchor constraint information, industry discrimination standard constraint information, and output protocol constraint information. The medical knowledge graph construction module constructs a medical knowledge graph from the effective triples of medical knowledge based on a multi-source traceability incremental fusion algorithm. The multi-source traceability incremental fusion algorithm is used to incrementally fuse the attributes of medical entity nodes or medical relationship edges constructed in the medical knowledge graph.