A dual-engine collaborative medical knowledge general extraction method

CN122309757APending Publication Date: 2026-06-30BEIJING MEIYIN INTELLIGENT DIGITAL TECHNOLOGY CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
BEIJING MEIYIN INTELLIGENT DIGITAL TECHNOLOGY CO LTD
Filing Date
2026-03-03
Publication Date
2026-06-30

Smart Images

  • Figure CN122309757A_ABST
    Figure CN122309757A_ABST
Patent Text Reader

Abstract

This invention relates to the field of knowledge extraction technology, specifically to a dual-engine collaborative general medical knowledge extraction method, comprising: using a preset first engine to extract entities from a first set of data to be processed, obtaining an atomic entity set and an information extraction set; using a preset second engine to perform logical reasoning on a second set of data to be processed and the atomic entity set, obtaining a logical conclusion set; fusing and verifying the information extraction set and the logical conclusion set to obtain passing samples and failing samples; storing the passing samples and performing data augmentation on the failing samples before further processing. This invention achieves joint processing of medical knowledge extraction and logical reasoning through dual-engine collaboration, improving extraction accuracy and consistency by fusing and verifying the information extraction results and logical conclusions; and enhancing the adaptability to medical semantics by performing data augmentation and iterative processing on failing samples, reducing manual intervention, and is applicable to general knowledge extraction from various types of medical texts.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of knowledge extraction technology, specifically to a dual-engine collaborative general method for extracting medical knowledge. Background Technology

[0002] With the advancement of medical informatization, a large amount of unstructured medical text data, such as electronic medical records, pathology reports, and imaging examination reports, has accumulated. Accurate and efficient structuring of this unstructured text is fundamental to building disease-specific databases, supporting clinical research analysis, and enabling in-depth utilization of medical data.

[0003] Existing medical text structuring techniques mainly fall into the following categories: First, rule-based and dictionary-based matching methods extract information through keyword matching using regular expressions or medical terminology dictionaries. While simple to implement, this method is highly dependent on the text's expression format, has limited generalization ability, struggles to handle complex linguistic phenomena such as negation and ambiguous expressions, and incurs high maintenance costs as the rule scale expands. Second, deep learning methods based on sequence labeling, such as BERT combined with conditional random field models, model the extraction task as a sequence labeling problem, and are widely used in industry. However, these methods heavily rely on manually labeled data, have high cold-start costs in new task scenarios, and are limited in performance with nested entities, non-continuous entities, and tasks requiring semantic understanding and logical reasoning. Third, end-to-end generation methods based on large language models directly generate structured results through Prompt, possessing strong semantic understanding capabilities. However, they suffer from problems such as generation illusions, unstable output formats, high inference costs, and slow response speeds, making it difficult to meet the accuracy and timeliness requirements of medical scenarios. Summary of the Invention

[0004] (a) Purpose of the invention The purpose of this invention is to provide a dual-engine collaborative general medical knowledge extraction method. By using dual-engine collaboration, it achieves joint processing of medical knowledge extraction and logical reasoning, and improves the accuracy and consistency of extraction by fusion and verification of information extraction results and logical conclusions. Furthermore, it performs data augmentation and iterative processing on samples that fail to pass the test, enhancing the adaptability to complex medical semantics, reducing the cost of manual intervention, and making it suitable for general knowledge extraction from various types of medical texts.

[0005] (II) Technical Solution To address the above problems, this invention provides a dual-engine collaborative general method for extracting medical knowledge, comprising: The medical data to be processed is distributed as a task to obtain the first data to be processed and the second data to be processed. The first engine is used to extract entities from the first data to be processed, and an atomic entity set and an information extraction set are obtained. The second engine is used to perform logical reasoning on the second set of data to be processed and the set of atomic entities to obtain a set of logical conclusions. The extracted information set and logical conclusion set are fused and verified to obtain passing and failing samples. The passed samples are stored, and the failed samples are augmented and processed until they become passed samples.

[0006] In another aspect of the present invention, preferably, the step of distributing the medical data to be processed to obtain the first data to be processed and the second data to be processed includes: Acquire medical data to be processed, wherein the medical data to be processed is unstructured medical text data, including medical record text, examination record text, or treatment record text; Based on a preset task distribution strategy, the medical data to be processed is divided into tasks to obtain the first data to be processed and the second data to be processed. The first data to be processed is used for entity recognition and attribute extraction, and the second data to be processed is used for semantic relationship analysis and knowledge reasoning.

[0007] In another aspect of the present invention, preferably, the preset first engine is constructed based on a pointer generation network of Prompt; The step of using a preset first engine to extract entities from the first data to be processed, obtaining an atomic entity set and an information extraction set, includes: Based on a preset sliding window size, the first data to be processed is segmented to obtain multiple sub-texts; Map the first data to be processed to the subtext to obtain the offset mapping relationship between the two; The medical prompts are concatenated with multiple sub-texts to form the first input; Using the first input and the preset first engine, an atomic entity set and an information extraction set are obtained.

[0008] In another aspect of the present invention, preferably, obtaining the atomic entity set and the information extraction set using the first input and a preset first engine includes: The first input is fed into the pre-trained language model encoder of the first engine to obtain the corresponding sequence feature representation; Based on the sequence feature representation, the pointer generation network of the first engine predicts the start and end positions of entities to obtain the boundary information of candidate entities; Based on the offset mapping relationship, the boundary information of the candidate entities is restored to the corresponding position of the first data to be processed to obtain the atomic entity set; Based on the atomic entity set, and combined with the entity type or attribute information corresponding to the medical prompt words, an information extraction set is obtained.

[0009] In another aspect of the present invention, preferably, the preset second engine is built based on a lightweight large model; The step of using a preset second engine to perform logical reasoning on the second data to be processed and the set of atomic entities to obtain a set of logical conclusions includes: Based on the preset focus rules in the preset second engine, the atomic entity set is filtered for relevance, and entities related to the reasoning task are retained to generate a list of clue entities. Based on the second data to be processed, the list of clue entities, and the preset task instructions, a second input is generated; The second input is logically reasoned based on a preset second engine to obtain a set of logical conclusions.

[0010] In another aspect of the present invention, preferably, the step of performing logical reasoning on the second input based on a preset second engine to obtain a set of logical conclusions includes: Based on the preset second engine, the second input is used to generate a reasoning result containing intermediate reasoning processes under the preset thinking chain guidance conditions. The reasoning result is subjected to restricted decoding, and the output result is structured and parsed based on a predefined value range to extract the reasoning conclusions that meet the requirements and generate a set of logical conclusions.

[0011] In another aspect of the present invention, preferably, the step of fusing and verifying the information extraction set and the logical conclusion set to obtain passing samples and failing samples includes: Based on the text content of the entities in the information extraction set, the logical attributes in the logical conclusion set are attached to the corresponding entities to generate a structured fusion result. Based on a preset rule verification list, linear rule verification is performed on each of the structured fusion results. If any verification rule is not met, the corresponding sample is marked as a failed sample, and samples that meet all rules are marked as preliminary passed samples. Obtain the confidence information of the entities that have initially passed the sample; Based on the confidence information, the initially passed samples are subjected to secondary verification to obtain passed samples and failed samples.

[0012] In another aspect of the present invention, preferably, the step of performing a secondary verification on the initially passed samples based on the confidence information to obtain passed samples and failed samples includes: When the confidence level is greater than or equal to the preset confidence threshold, the corresponding sample is determined to be a pass sample; When the confidence level is less than the preset confidence threshold, a new judgment is made using a preset verification model; Based on the results of the reassessment, the samples that initially passed will be determined as either passed or failed.

[0013] In another aspect of the present invention, preferably, storing the passing samples and performing data augmentation on the failing samples before further processing until they become passing samples includes: The data that is determined to be acceptable will be stored in the target database; Samples deemed unsuccessful are de-identified using preset regular expression masking rules to generate seed samples. The seed samples are augmented based on a pre-defined teacher model to generate synthetic samples, and corresponding pseudo-labels are generated for the synthetic samples. The process continues based on the synthesized samples and pseudo-labels until the failed samples become passed samples.

[0014] In another aspect of the present invention, preferably, the processing based on the synthetic sample and the pseudo-label until the failed sample becomes a passed sample includes: The synthesized sample is subjected to consistency verification, and then input into the first engine to obtain the prediction result; Based on the degree of consistency between the prediction results and the pseudo-labels, a first enhanced sample is obtained; Based on the first enhanced sample, the parameters of the first engine and the second engine are updated; The failed samples are reprocessed using the updated first and second engines until they become passed samples.

[0015] (III) Beneficial Effects The above-described technical solution of the present invention has the following beneficial technical effects: This invention distributes medical data to be processed, with a first engine extracting entities and a second engine performing logical reasoning based on the atomic entity set, achieving collaborative processing of information extraction and logical analysis. By fusing and verifying the extracted information set and the logical conclusion set, it effectively reduces the problems of false and missed extraction caused by the complexity and ambiguity of medical text, improving the accuracy and consistency of medical knowledge extraction results. Simultaneously, samples that fail verification are introduced into the data augmentation and reprocessing process, enabling the system to iteratively optimize for complex and marginal samples, continuously improving extraction capabilities and reducing reliance on manual rule adjustments and annotations. This invention is applicable to various types of medical text data, such as electronic medical records, examination reports, and pathology reports, and possesses good versatility, stability, and scalability, providing high-quality structured data support for medical knowledge base construction, clinical research analysis, and intelligent medical applications. Attached Figure Description

[0016] Figure 1 This is an overall flowchart of one embodiment of the present invention. Detailed Implementation

[0017] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to specific embodiments and the accompanying drawings. It should be understood that these descriptions are merely exemplary and not intended to limit the scope of the invention. Furthermore, descriptions of well-known structures and techniques are omitted in the following description to avoid unnecessarily obscuring the concept of the invention.

[0018] Obviously, the described embodiments are only some, not all, of the embodiments of the present invention. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without inventive effort are within the scope of protection of the present invention.

[0019] In the description of this invention, it should be noted that the terms "first," "second," and "third" are used for descriptive purposes only and should not be construed as indicating or implying relative importance.

[0020] Furthermore, the technical features involved in the different embodiments of the present invention described below can be combined with each other as long as they do not conflict with each other.

[0021] The invention will now be described in more detail with reference to the accompanying drawings. In the various drawings, the same elements are indicated by similar reference numerals. For clarity, the various parts in the drawings are not drawn to scale.

[0022] Example 1 A dual-engine collaborative general method for extracting medical knowledge is proposed, applicable to knowledge extraction scenarios of unstructured or semi-structured medical data such as electronic medical records, examination reports, pathology reports, and medical order records. Figure 1 An overall flowchart of one embodiment of the present invention is shown, as follows: Figure 1 As shown, it includes: The medical data to be processed is distributed as tasks to obtain first data to be processed and second data to be processed. According to a preset task distribution strategy, the same batch of medical data can be divided according to data type, text structure characteristics, processing complexity, or processing purpose to obtain first data to be processed and second data to be processed. In this embodiment, the process of distributing the medical data to be processed to obtain first data to be processed and second data to be processed includes: Acquire medical data to be processed, wherein the medical data to be processed is unstructured medical text data, including medical record text, examination record text, or treatment record text; Based on a preset task distribution strategy, the medical data to be processed is divided into tasks to obtain first data to be processed and second data to be processed. The preset task distribution strategy is used to determine how the medical data is allocated in different processing engines, and it can be formulated according to at least one of the following: data type, text structure characteristics, processing complexity, or processing purpose. In one embodiment, the data can be divided according to its data type; for example, medical record texts that are mainly descriptive and densely packed with entity information are classified as first data to be processed, while texts containing diagnostic conclusions, examination conclusions, or treatment decisions are classified as second data to be processed. In another embodiment, the data can be divided according to its text structure characteristics; for example, texts with relatively loose structure and more natural language expression are classified as first data to be processed, while texts with clear semantic relationships are classified as second data to be processed.

[0023] In this process, the first data to be processed is used for entity recognition and attribute extraction, while the second data to be processed is used for semantic relationship analysis and knowledge reasoning. The first data to be processed is used in subsequent entity recognition and attribute extraction to obtain basic atomic entity information; the second data to be processed is used in subsequent semantic relationship analysis and knowledge reasoning to support logical judgments and relationship inferences between entities. By dividing medical data into tasks at the pre-processing stage, different processing engines can focus on their respective strengths, thereby improving the accuracy and efficiency of the overall medical knowledge extraction process.

[0024] The first engine, pre-defined, is used to extract entities from the first data to be processed, obtaining an atomic entity set and an information extraction set. The first engine can be an entity extraction engine built based on rules, dictionaries, statistical models, or deep learning models, used to identify and extract structured atomic entity information from medical text. Through this step, an atomic entity set and the corresponding information extraction set are obtained. The atomic entity set includes basic medical entities such as disease names, symptoms and signs, examination items, examination indicators, drug names, treatment methods, and time information. The information extraction set is used to characterize the attribute information, contextual relationships, or preliminary extraction results of the atomic entities in the text. Further, in this embodiment, the pre-defined first engine is built based on Prompt's pointer generation network; it is used to achieve high-precision entity recognition and attribute extraction in medical text. The first engine introduces medical domain prompts to constrain and guide the model's extraction target, and uses a pointer generation mechanism to directly locate entity positions in the original text, thereby avoiding entity boundary drift and semantic ambiguity issues.

[0025] The step of using a preset first engine to extract entities from the first data to be processed, obtaining an atomic entity set and an information extraction set, includes: Based on a preset sliding window size, the first data to be processed is segmented to obtain multiple sub-texts. The sliding window size controls the length of a single input text, which can be set according to model input constraints or text complexity. Adjacent sub-texts can overlap to avoid entities being truncated across segments. Through sliding window segmentation, long medical texts can be decomposed into multiple sub-texts suitable for model processing.

[0026] The first data to be processed is mapped to the subtext to obtain the offset mapping relationship between the two. The offset mapping relationship is used to record the start and end positions of each subtext in the original medical text, so that after the entity extraction is completed, the entity positions in the subtext can be accurately restored to the original text. By establishing the offset mapping relationship, it is ensured that the entities extracted from different subtexts can be uniformly mapped to the same text coordinate system.

[0027] The medical prompts are concatenated with multiple sub-texts to form the first input. The medical prompts indicate the target type and extraction rules for entity extraction, and may include entity category descriptions such as diseases, symptoms, examination items, examination indicators, drugs, and treatment methods, as well as attribute extraction requirements or format constraints. By concatenating the medical prompts with the sub-texts, the first engine can output according to the preset entity types and extraction specifications under the guidance of the prompts during the extraction process.

[0028] Using the first input and the preset first engine, an atomic entity set and an information extraction set are obtained, including: The first input is fed into the pre-trained language model encoder of the first engine to obtain the corresponding sequence feature representation. The first input is formed by concatenating medical prompt words with subtext. The encoder is used to perform contextual semantic modeling on the input text, mapping each character or word in the text into a vector representation containing semantic information. By modeling contextual information through the pre-trained language model, the model can comprehensively consider the context before and after the entity, thereby providing a semantic basis for subsequent entity boundary prediction.

[0029] Based on the sequence feature representation, the pointer generation network of the first engine predicts the start and end positions of entities to obtain the boundary information of candidate entities. The pointer generation network is used to point to text positions that may belong to entities in the sequence feature representation. It predicts the start and end positions of entities by calculating the probability distribution of each position in the sequence. In this way, entity boundaries are output in the form of pointer indices, thus directly corresponding to specific positions in the original input text, avoiding the boundary inconsistency problems that may arise from using the label sequence method.

[0030] Based on the offset mapping relationship, the boundary information of candidate entities is restored to the corresponding positions in the first data to be processed, thus obtaining an atomic entity set. Since the entity boundary information is predicted within the sub-text, by introducing the offset mapping relationship, the relative positions of candidate entities in the sub-text can be mapped to their absolute positions in the first data to be processed, thereby achieving a unified expression of entity results in different sub-texts. By deduplicating, merging, or handling conflicts on the mapped entity boundaries, a complete atomic entity set is formed.

[0031] Based on the set of atomic entities, and combined with the entity type or attribute information corresponding to the medical prompt words, an information extraction set is obtained. Specifically, according to the preset entity category constraints or attribute extraction requirements in the medical prompt words, the atomic entities are labeled with types, have attributes completed, or are structured to form attribute information or semantic description information corresponding to the entities. The information extraction set is used to represent the extraction results after type differentiation and attribute association are completed at the entity level, so as to facilitate subsequent logical reasoning or knowledge fusion processing.

[0032] A pre-defined second engine is used to perform logical reasoning on the second set of data to be processed and the set of atomic entities to obtain a set of logical conclusions. The second engine can be a rule-based reasoning, knowledge graph reasoning, or model-based reasoning engine. It is used to analyze and infer the potential relationships between entities based on atomic entities, combined with medical knowledge rules, contextual constraints, or prior logic, thereby obtaining a set of logical conclusions. The set of logical conclusions reflects the logical consistency, causal relationships, conditional relationships, or the results of judgments on the rationality of diagnosis and treatment between entities. In this embodiment, the pre-defined second engine is built based on a lightweight large model; it is used to complete semantic understanding and logical reasoning based on the entity extraction results. The lightweight large model, while ensuring basic language understanding and reasoning capabilities, reduces computational resource consumption through model parameter scale control, reasoning strategy optimization, or the introduction of task constraints, making it adaptable to the high-frequency call requirements in medical knowledge extraction scenarios. Examples include CodeQwen-1.5-7B and TinyLlama-1.1B.

[0033] The step of using a preset second engine to perform logical reasoning on the second data to be processed and the set of atomic entities to obtain a set of logical conclusions includes: Based on the preset focus rules in the second engine, the atomic entity set is filtered for relevance, retaining entities relevant to the reasoning task and generating a clue entity list. Focus rules are used to limit the scope of entities that logical reasoning needs to focus on; they can be set according to the reasoning task type, medical scenario, or business objective, such as prioritizing disease entities, abnormal examination indicators, or key treatment behaviors. By filtering the atomic entity set, interference from entities unrelated to the current reasoning task is reduced, improving the targeting and efficiency of subsequent logical reasoning.

[0034] Based on the second data to be processed, the list of clue entities, and the preset task instructions, a second input is generated. The second data to be processed provides the original semantic context, the list of clue entities clarifies the core objects of logical reasoning, and the task instructions constrain the reasoning objective and the form of the reasoning output. By combining the above information, a second input adapted to the input format of the second engine is formed, enabling the lightweight large model to perform reasoning processing with a clear reasoning objective.

[0035] Logical reasoning is performed on the second input based on a preset second engine to obtain a set of logical conclusions. The set of logical conclusions represents the reasoning results derived from atomic entities and their contextual semantics, and may include entity relationship judgment results, semantic consistency judgment results, or medical logical rationality judgment results. In this embodiment, the logical reasoning performed on the second input based on the preset second engine to obtain the set of logical conclusions includes: Based on a pre-defined second engine, the second input is processed under pre-defined thought chain guidance conditions to generate a reasoning result containing intermediate reasoning processes. The thought chain guidance conditions guide the second engine to analyze the reasoning process according to pre-defined logical steps. These conditions can be implemented through implicit or explicit reasoning constraints in the task instructions. For example, the model might first identify key entities, then analyze the relationships between entities, and finally provide a reasoning conclusion. By introducing thought chain guidance, the second engine can explicitly or implicitly form intermediate reasoning processes when generating reasoning results, thereby improving the interpretability and stability of the reasoning results.

[0036] The inference result undergoes restricted decoding, and the output result is structurally parsed based on a predefined value range to extract the inference conclusions that meet the requirements, generating a logical conclusion set. Restricted decoding limits the output content of the second engine, ensuring it is generated only within a predefined candidate value range or output format, thus avoiding the uncertainty caused by free text output. The predefined value range may include a preset set of relation types, a set of judgment tags, or a set of logical states, etc. After restricted decoding, the output result is structurally parsed, extracting the content that meets the preset format and value requirements as structured inference conclusions. The logical conclusion set represents the inference result that has undergone logical reasoning and meets the constraints; it can be represented in the form of relation triples, judgment tags, or rule matching results, etc., to facilitate subsequent information fusion, consistency verification, or result storage.

[0037] The extracted information set and logical conclusion set are fused and verified to obtain passed and failed samples. Specifically, the entity extraction results and logical reasoning results can be cross-validated through consistency checks, conflict detection, or credibility assessments to classify the extracted samples into passed and failed samples. Passed samples indicate that the entity extraction results and logical reasoning results meet preset consistency or credibility requirements; failed samples indicate situations where entities are missing, entities are ambiguous, logically conflicting, or credibility is insufficient.

[0038] Furthermore, in this embodiment, the step of fusing and verifying the extracted information set and the logical conclusion set to obtain passing and failing samples includes: Based on the text content of entities in the information extraction set, the logical attributes in the logical conclusion set are attached to the corresponding entities to generate a structured fusion result. Specifically, the atomic entities in the information extraction set are used as the fusion subjects. According to the text content, location information or unique identifier of the entity, the logical attributes in the logical conclusion set related to the entity are matched and associated, thereby forming a structured fusion result at the entity level that simultaneously contains entity attribute information and logical reasoning attributes.

[0039] Based on a pre-defined rule validation list, linear rule validation is performed on each structured fusion result. If any validation rule is not met, the corresponding sample is marked as a failed sample; samples that meet all validation rules are marked as initially passed samples. When all validation rules are met, the corresponding sample is marked as initially passed samples. The rule validation list describes the basic consistency constraints that should be met between the entity extraction results and the logical conclusions, which may include entity integrity validation, attribute rationality validation, or logical conflict validation. By linearly executing rule validation, results that clearly do not conform to medical common sense or logical constraints can be quickly filtered out.

[0040] The confidence information of the entities that have passed the initial sample is obtained. The confidence information is used to characterize the credibility of the entity extraction and attribute mounting results, and it can be obtained from the confidence score output by the entity extraction model, the stability score of the logical reasoning results, or a weighted sum of the two. By introducing confidence information, a quantitative basis is provided for subsequent refined verification.

[0041] Based on the confidence information, a secondary verification is performed on the initially passed samples to obtain passed and failed samples, including: When the confidence level is greater than or equal to the preset confidence threshold, the corresponding sample is determined to be a pass sample; When the confidence level is less than the preset confidence threshold, a preset verification model is used for re-judgment. The verification model is used to supplement the judgment of samples in low confidence scenarios. It can make judgments based on simplified reasoning strategies or auxiliary features, thereby avoiding the direct discarding of potentially valid samples.

[0042] Based on the results of the second assessment, the initially approved samples are determined as either approved or rejected. This approach ensures sampling accuracy while reducing false positives and false negatives, thereby improving the overall quality of the final approved samples.

[0043] The passed samples are stored, and the failed samples are augmented and processed again until they become passed samples. Passed samples are stored as high-quality structured medical knowledge for subsequent applications such as knowledge base construction, clinical decision support, or scientific research analysis. Failed samples undergo data augmentation before re-entering the extraction process. Data augmentation may include sample re-labeling, context completion, rule correction, or model parameter adjustment to improve the extractability and decidability of the samples. Through repeated processing and verification of failed samples until they meet the passing conditions, the results of medical knowledge extraction are continuously optimized. This embodiment includes: Data from samples deemed acceptable will be stored in the target database for subsequent clinical knowledge base construction, research analysis, or continuous model training. These acceptable samples serve as high-quality structured medical knowledge data, ensuring their direct usability.

[0044] Samples deemed unsuccessful are de-identified using pre-defined regular expression masking rules to generate seed samples. These rules protect sensitive information, such as patient personal information, identification numbers, or specific medical privacy data, while preserving semantic content useful for entity extraction and logical reasoning tasks, thus generating seed samples that can be used for data augmentation.

[0045] Based on a pre-set teacher model, the seed samples are augmented to generate synthetic samples, and corresponding pseudo-labels are generated for the synthetic samples. The teacher model can be a pre-trained large-scale language model or a domain-specific knowledge model. By generating seed samples in a variety of ways, such as generating synonyms, replacing synonyms, and expanding context, synthetic samples are generated that are semantically consistent with the original samples but have different expressions. At the same time, pseudo-labels are assigned to each synthetic sample according to the output generated by the teacher model or the information of the original samples, which are used to guide the subsequent model training or validation.

[0046] Processing is performed based on the synthesized samples and pseudo-labels until the failed samples become passed samples, including: The synthesized sample is subjected to consistency verification, and then input into the first engine to obtain the prediction result; Based on the degree of consistency between the prediction results and the pseudo-labels, a first augmented sample is obtained. By comparing the degree of consistency between the prediction results and the pseudo-labels, augmented samples that meet a preset consistency threshold are selected to form the first augmented sample. The consistency check is used to ensure the reliability of the augmented samples and the accuracy of the labels, and to prevent low-quality generated samples from affecting model training.

[0047] Based on the first enhanced sample, the parameters of the first engine and the second engine are updated; by using enhanced samples and their pseudo-labels, the model is fine-tuned or incrementally trained to improve the model's ability to extract and reason in edge samples and complex scenarios, so that the model has higher processing accuracy for failed samples.

[0048] The failed samples are reprocessed using the updated first and second engines until they become successful samples. Through multiple rounds of iterative processing, the failed samples are gradually optimized, ensuring that their extraction results are highly reliable and consistent with the logical reasoning results, thus ultimately being added to the set of successful samples.

[0049] Through the aforementioned closed-loop data enhancement and iterative optimization mechanism, this embodiment can fully utilize untested samples to improve model capabilities while ensuring the accuracy of extraction and inference, thereby forming an adaptive, high-quality medical knowledge extraction system and continuously improving the integrity and reliability of the overall knowledge base.

[0050] This invention distributes medical data to be processed, with a first engine extracting entities and a second engine performing logical reasoning based on the atomic entity set, achieving collaborative processing of information extraction and logical analysis. By fusing and verifying the extracted information set and the logical conclusion set, it effectively reduces the problems of false and missed extraction caused by the complexity and ambiguity of medical text, improving the accuracy and consistency of medical knowledge extraction results. Simultaneously, samples that fail verification are introduced into the data augmentation and reprocessing process, enabling the system to iteratively optimize for complex and marginal samples, continuously improving extraction capabilities and reducing reliance on manual rule adjustments and annotations. This invention is applicable to various types of medical text data, such as electronic medical records, examination reports, and pathology reports, and possesses good versatility, stability, and scalability, providing high-quality structured data support for medical knowledge base construction, clinical research analysis, and intelligent medical applications.

[0051] It should be understood that the specific embodiments described above are merely illustrative or explanatory of the principles of the invention and do not constitute a limitation thereof. Therefore, any modifications, equivalent substitutions, improvements, etc., made without departing from the spirit and scope of the invention should be included within the protection scope of the invention. Furthermore, the appended claims are intended to cover all variations and modifications falling within the scope and boundaries of the appended claims, or equivalent forms of such scope and boundaries.

[0052] The present invention has been described above with reference to embodiments thereof. However, these embodiments are merely illustrative and not intended to limit the scope of the invention. The scope of the invention is defined by the appended claims and their equivalents. Various substitutions and modifications can be made by those skilled in the art without departing from the scope of the invention, and all such substitutions and modifications should fall within the scope of the invention.

[0053] Although embodiments of the present invention have been described in detail, it should be understood that various changes, substitutions, and modifications can be made to the embodiments of the present invention without departing from the spirit and scope of the invention.

[0054] Obviously, the above embodiments are merely illustrative examples for clear explanation and are not intended to limit the implementation. Those skilled in the art will recognize that other variations or modifications can be made based on the above description. It is neither necessary nor possible to exhaustively list all possible implementations here. However, obvious variations or modifications derived therefrom are still within the scope of protection of this invention.

Claims

1. A dual-engine collaborative medical knowledge general extraction method, characterized in that, include: The medical data to be processed is distributed as a task to obtain the first data to be processed and the second data to be processed. The first engine is used to extract entities from the first data to be processed, and an atomic entity set and an information extraction set are obtained. The second engine is used to perform logical reasoning on the second set of data to be processed and the set of atomic entities to obtain a set of logical conclusions. The extracted information set and logical conclusion set are fused and verified to obtain passing and failing samples. The passed samples are stored, and the failed samples are augmented and processed until they become passed samples.

2. The dual engine synergistic medical knowledge general extraction method of claim 1, wherein, The step of distributing the medical data to be processed to obtain the first data to be processed and the second data to be processed includes: Acquire medical data to be processed, wherein the medical data to be processed is unstructured medical text data, including medical record text, examination record text, or treatment record text; Based on a preset task distribution strategy, the medical data to be processed is divided into tasks to obtain the first data to be processed and the second data to be processed. The first data to be processed is used for entity recognition and attribute extraction, and the second data to be processed is used for semantic relationship analysis and knowledge reasoning.

3. The dual engine synergistic medical knowledge general extraction method of claim 2, wherein, The preset first engine is built based on the pointer generation network of Prompt; The step of using a preset first engine to extract entities from the first data to be processed, obtaining an atomic entity set and an information extraction set, includes: Based on a preset sliding window size, the first data to be processed is segmented to obtain multiple sub-texts; Map the first data to be processed to the subtext to obtain the offset mapping relationship between the two; The medical prompts are concatenated with multiple sub-texts to form the first input; Using the first input and the preset first engine, an atomic entity set and an information extraction set are obtained.

4. The dual engine synergistic medical knowledge general extraction method of claim 3, wherein, The step of obtaining the atomic entity set and information extraction set by utilizing the first input and the preset first engine includes: The first input is fed into the pre-trained language model encoder of the first engine to obtain the corresponding sequence feature representation; Based on the sequence feature representation, the pointer generation network of the first engine predicts the start and end positions of entities to obtain the boundary information of candidate entities; Based on the offset mapping relationship, the boundary information of the candidate entities is restored to the corresponding position of the first data to be processed to obtain the atomic entity set; Based on the atomic entity set, and combined with the entity type or attribute information corresponding to the medical prompt words, an information extraction set is obtained.

5. The dual engine synergistic medical knowledge general extraction method of claim 4, wherein, The preset second engine is built based on a lightweight large model; The step of using a preset second engine to perform logical reasoning on the second data to be processed and the set of atomic entities to obtain a set of logical conclusions includes: Based on the preset focus rules in the preset second engine, the atomic entity set is filtered for relevance, and entities related to the reasoning task are retained to generate a list of clue entities. Based on the second data to be processed, the list of clue entities, and the preset task instructions, a second input is generated; The second input is logically reasoned based on a preset second engine to obtain a set of logical conclusions.

6. The dual engine synergistic medical knowledge general extraction method of claim 5, wherein, The second input is logically reasoned based on a preset second engine to obtain a set of logical conclusions, including: Based on the preset second engine, the second input is used to generate a reasoning result containing intermediate reasoning processes under the preset thinking chain guidance conditions. The reasoning result is subjected to restricted decoding, and the output result is structured and parsed based on a predefined value range to extract the reasoning conclusions that meet the requirements and generate a set of logical conclusions.

7. The dual engine synergistic medical knowledge general extraction method of claim 6, wherein, The process of fusing and verifying the extracted information set and logical conclusion set to obtain passing and failing samples includes: Based on the text content of the entities in the information extraction set, the logical attributes in the logical conclusion set are attached to the corresponding entities to generate a structured fusion result. Based on a preset rule verification list, linear rule verification is performed on each of the structured fusion results. If any verification rule is not met, the corresponding sample is marked as a failed sample, and samples that meet all rules are marked as preliminary passed samples. Obtain the confidence information of the entities that have initially passed the sample; Based on the confidence information, the initially passed samples are subjected to secondary verification to obtain passed samples and failed samples.

8. The dual-engine collaborative general medical knowledge extraction method according to claim 7, characterized in that, The step of performing a secondary verification on the initially passed samples based on the confidence information to obtain passed and failed samples includes: When the confidence level is greater than or equal to the preset confidence threshold, the corresponding sample is determined to be a pass sample; When the confidence level is less than the preset confidence threshold, a new judgment is made using a preset verification model; Based on the results of the reassessment, the samples that initially passed will be determined as either passed or failed.

9. The dual-engine collaborative general medical knowledge extraction method according to claim 8, characterized in that, The step of storing the passing samples and performing data augmentation on the failing samples before further processing until they become passing samples includes: The data that is determined to be acceptable will be stored in the target database; Samples deemed unsuccessful are de-identified using preset regular expression masking rules to generate seed samples. The seed samples are augmented based on a pre-defined teacher model to generate synthetic samples, and corresponding pseudo-labels are generated for the synthetic samples. The process continues based on the synthesized samples and pseudo-labels until the failed samples become passed samples.

10. The dual-engine collaborative general medical knowledge extraction method according to claim 9, characterized in that, The process based on the synthesized samples and pseudo-labels, until the failed samples become passed samples, includes: The synthesized sample is subjected to consistency verification, and then input into the first engine to obtain the prediction result; Based on the degree of consistency between the prediction results and the pseudo-labels, a first enhanced sample is obtained; Based on the first enhanced sample, the parameters of the first engine and the second engine are updated; The failed samples are reprocessed using the updated first and second engines until they become passed samples.