A clinical data self-adaptive cleaning method and system

CN122266601APending Publication Date: 2026-06-23联通数智医疗科技有限公司

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: 联通数智医疗科技有限公司
Filing Date: 2026-03-25
Publication Date: 2026-06-23

Application Information

Patent Timeline

25 Mar 2026

Application

23 Jun 2026

Publication

CN122266601A

IPC: G16H10/40; G16H10/60; G16H30/00; G06F16/215; G06F16/25; G06F16/28; G06F16/21; G06F16/2455

AI Tagging

Application Domain

Database management systems Medical images

Technology Topics

Software engineeringRelational table

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

All-in-one computer (touch)
CN310028993SSoftware engineering Mechanical engineering
Method, device, and storage medium for presenting a bullet comment
US20260172639A1Image data processing Selective content distributionVirtual spaceSoftware engineering
Toy figure (Qin Meng's four treasures series - Juliette the heron)
CN310030049SSoftware engineering Mechanical engineering
Alignment station for aligning a stack of valuable documents
DE112024003004A5Paper article packagingCoin/currency accepting devicesSoftware engineering Mechanical engineering
Electronic lock (C8)
CN310032474SSoftware engineering Mechanical engineering

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing clinical data cleaning technologies are inefficient, costly, have low standardization, and poor reusability, which severely restricts the integration, sharing, and value mining of clinical data.

Method used

A multi-layered standardized database is constructed, including a clinical data standard library and a business dictionary library. A mapping table between metadata and the target data model is automatically generated, and a data cleaning script is generated based on this. Automated data cleaning is achieved by using an intelligent mapping model and a knowledge graph of cleaning operations.

Benefits of technology

It significantly shortened the data cleaning cycle of the new system, improved data processing efficiency, significantly reduced implementation and maintenance costs, and improved the automation and accuracy of data cleaning.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122266601A_ABST

Patent Text Reader

Abstract

The application discloses a clinical data self-adaptive cleaning method and system, comprising the following steps: constructing a standardized database; accessing a target business system, and automatically generating a mapping relationship table of metadata of the target business system to standard data of the target data model based on the metadata dictionary; automatically generating a corresponding data cleaning script based on the mapping relationship table; and executing the data cleaning script to convert data cleaning of the target business system into standard data conforming to the target data model. Through reuse of the knowledge base, the application greatly shortens a new system data cleaning period, and greatly improves data processing efficiency; through automation of the process, the application reduces manual workload in aspects of dictionary analysis, mapping analysis, script coding and the like, and significantly reduces implementation and maintenance costs of data cleaning.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of medical information technology and data governance, specifically to an adaptive cleaning method and system for clinical data. Background Technology

[0002] With the deepening development of medical informatization, hospitals have widely deployed various business systems such as HIS, LIS, RIS, and EMR. These systems come from different vendors and have undergone multiple version iterations, resulting in severe fragmentation and heterogeneity of clinical data, lacking unified standards in data format, field definition, and coding rules. However, existing data cleaning technologies generally suffer from low efficiency, high cost, low standardization, and poor reusability, severely restricting the integration, sharing, and value extraction of clinical data. Summary of the Invention

[0003] The main objective of this application is to provide an adaptive clinical data cleaning method and system, which aims to at least solve one of the above-mentioned technical problems.

[0004] The first aspect of this application provides an adaptive cleaning method for clinical data, including the following steps: Step S1: Construct a standardized database, wherein the standardized database has a multi-layer structure and includes at least a clinical data standard library and a business dictionary library. The clinical data standard library is used to store target data models based on clinical data standard specifications, and the business dictionary library is used to store metadata dictionaries from different business systems. Step S2: Access the target business system and, based on the metadata dictionary, automatically generate a mapping table from the metadata of the target business system to the standard data of the target data model; Step S3: Based on the mapping table, automatically generate the corresponding data cleaning script; Step S4: Execute the data cleaning script to clean and convert the data of the target business system into standard data that conforms to the target data model.

[0005] In some embodiments of this application, the standardized database further includes a mapping relationship library and a standard cleaning script library; the mapping relationship library is used to store confirmed mapping relationship pairs from metadata fields to standard data fields, and the standard cleaning script library is used to store standardized script templates corresponding to cleaning operations.

[0006] In some embodiments of this application, step S2 includes: Step S21: Automatically extract the metadata of the target business system, and perform business semantic parsing on the metadata fields of the target business system in conjunction with the business dictionary library; Step S22: Based on the semantic parsing results, according to the preset matching rules, determine whether there is a mapping relationship in the mapping relationship library that matches the metadata field of the target business system. If there is, the mapping relationship is directly used as the first candidate mapping relationship. Step S23: If no matching mapping relationship exists, a second candidate mapping relationship is generated through the intelligent mapping model, wherein the intelligent mapping model is an artificial intelligence model that takes the metadata fields of the target business system as input and the mapping relationship between the metadata fields of the target business system and the standard data fields of the target data model as output. Step S24: Generate the mapping table based on the first candidate mapping relationship and / or the second candidate mapping relationship.

[0007] In some embodiments of this application, the intelligent mapping model includes a field matching model that integrates multimodal features. By analyzing at least two of the semantic features, structural features, data distribution features, and contextual features of metadata fields, it comprehensively calculates the matching confidence between the metadata fields of the target business system and the standard data fields of the target data model to generate candidate mapping relationships, wherein: Semantic features are semantic vectors generated based on field names and business annotations; Structural features, including data type and length-constrained encoding vectors; Data distribution characteristics are statistical feature vectors obtained based on field value sampling analysis. Contextual features are context vectors generated based on the data table to which a field belongs and the business domain.

[0008] In some embodiments of this application, the field matching model that integrates multimodal features automatically generates and outputs the second candidate mapping relationship based on the preset threshold range in which the matching confidence level is located, or the generated second candidate mapping relationship is manually reviewed and then output. Step S2 is followed by using the judgment result of manual review as a new training sample for the field matching model that integrates multimodal features, and updating the model parameters in real time or periodically.

[0009] In some embodiments of this application, step S3 includes: Step S31: Construct a knowledge graph for cleaning operations, wherein the nodes of the knowledge graph include data objects, cleaning operators, and business rules, and the edges of the knowledge graph define the relationships between nodes; Step S32: Determine whether there is a historical cleaning script that matches the vendor and version information of the target business system; Step S33: If it exists, then perform the following steps: Step S331: Decompile the historical cleaning script into a first knowledge subgraph, and model the data pattern of the target business system into a second knowledge subgraph; Step S332: Calculate the structural differences between the first knowledge subgraph and the second knowledge subgraph; Step S333: Based on the structural differences and the knowledge graph of the cleaning operation, modify the historical cleaning script to form the data cleaning script; S34. If it does not exist, then perform the following steps: Step S341: Convert the mapping relationships in the mapping relationship table into queries on the knowledge graph of the cleaning operation; Step S342: Infer the required data cleaning operators and execution sequences from the knowledge graph of the cleaning operation; Step S343: Based on the reasoning path, select and assemble script templates from the standard cleaning script library to form the data cleaning script.

[0010] In some embodiments of this application, the method further includes updating the newly generated mapping relationship to the mapping relationship library; and updating the newly generated data cleaning script to the standard cleaning script library.

[0011] In some embodiments of this application, step S3 is followed by a step of optimizing the newly generated data cleaning script: Step S401: Receive script modification instructions input by technicians via natural language; Step S402: Parse the natural language instructions and convert them into specific modification operations for the data cleaning script; Step S403: Update the data cleaning script according to the modification operation.

[0012] In some embodiments of this application, after step S4, the method further includes: performing a multi-dimensional quality assessment on the cleaned standard data; when the assessment result does not meet the preset standard, analyzing the cleaning execution log to locate the cleaning step that caused the quality problem and generating a correction suggestion.

[0013] Another aspect of this application provides a clinical data adaptive cleaning system for implementing the aforementioned clinical data adaptive cleaning method, the system comprising: The knowledge base management module is used to build and maintain a standardized database, wherein the standardized database has a multi-layer structure and includes at least a clinical data standard library and a business dictionary library. The clinical data standard library is used to store target data models based on clinical data standard specifications, and the business dictionary library is used to store metadata dictionaries from different business systems. The intelligent mapping module is communicatively connected to the knowledge base management module. It is used to access the target business system and automatically generate a mapping relationship table from the metadata of the target business system to the standard data of the target data model based on the metadata dictionary. The script generation and execution module is communicatively connected to the intelligent mapping module and is used to automatically generate corresponding data cleaning scripts based on the mapping relationship table and execute cleaning operations. The quality assessment and closed-loop learning module is communicatively connected to the intelligent mapping module, the script generation and execution module, and the knowledge base management module. It is used to assess the quality of the cleaned standard data and update the newly generated mapping relationship and the newly generated data cleaning script to the standardized database.

[0014] The embodiments of this application include at least the following beneficial effects: by reusing the knowledge base, the data cleaning cycle of the new system is greatly shortened, and the data processing efficiency is greatly improved; by automating the process, the amount of manual work in dictionary sorting, mapping analysis, script coding, etc. is reduced, and the implementation and maintenance costs of data cleaning are significantly reduced.

[0015] Additional aspects and advantages of this application will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of this application. Attached Figure Description

[0016] To more clearly illustrate the technical solutions of the embodiments of this application, the relevant drawings of the embodiments of this application are described below. It should be understood that the drawings described below are only for the convenience of clearly describing some embodiments of the technical solutions of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0017] Figure 1 This is a flowchart of the adaptive cleaning method for clinical data provided in the embodiments of this application; Figure 2 This is an exemplary physical hardware environment for the operation of the adaptive clinical data cleaning method provided in the embodiments of this application; Figure 3 This is a detailed flowchart of the mapping relationship generation steps in the embodiments of this application; Figure 4 This is a detailed flowchart of the cleaning script generation steps in the embodiments of this application; Figure 5 This is a detailed flowchart of the optimization of the newly generated data cleaning script in the embodiments of this application. Detailed Implementation

[0018] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of this application and are not intended to limit the scope of this application. In the following description, when referring to the accompanying drawings, unless otherwise indicated, the same numbers in different drawings represent the same or similar elements. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with those of this application; they are merely examples of circuits, systems, apparatus, and methods consistent with some aspects of the embodiments of this application as detailed in the appended claims.

[0019] It is understood that the terms "first," "second," etc., used in this application may be used to describe various technical features, but unless otherwise specified, these technical features are not limited by these terms. These terms are only used to distinguish one technical feature from another and should not be construed as indicating or implying relative importance or implicitly specifying the number of technical features indicated. For example, without departing from the scope of the embodiments of this application, a first element may also be referred to as a second element, and similarly, a second element may also be referred to as a first element.

[0020] Unless otherwise defined, the technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of this application only and is not intended to limit this application.

[0021] As used in this application, the terms "at least one", "multiple", "each", "any", etc., "at least one" includes one, two or more, "multiple" includes two or more, "each" refers to each of the corresponding multiples, and "any" refers to any one of the multiples.

[0022] It should be understood that the terms "center," "longitudinal," "lateral," etc., indicate the orientation or positional relationship based on the accompanying drawings, and are used only for the convenience of describing this application and simplifying the description, and do not indicate or imply that the elements referred to must have a specific orientation, or be constructed and operated in a specific orientation. The term "and / or" includes any and all combinations of one or more of the related listed items. Those skilled in the art can understand the specific meaning of the above terms in this application according to the specific circumstances.

[0023] The following describes in detail, with reference to the accompanying drawings, a clinical data adaptive cleaning method and system provided in the embodiments of this application.

[0024] Before providing a detailed description of the embodiments of this application, some of the nouns and terms involved in the embodiments of this application will be explained first. The nouns and terms involved in the embodiments of this application are subject to the following interpretations.

[0025] CDR (Clinical Data Repository) specification: A universal and core best practice specification in the field of medical informatics, which can be understood as a methodology and standard system for building clinical data centers that is widely recognized in the industry.

[0026] HIS (Hospital Information System): The core platform for hospital operation and management, mainly responsible for handling basic business processes such as administrative, financial and clinical affairs, including patient registration, billing, inpatient registration, and drug management.

[0027] LIS (Laboratory Information System): A business system designed specifically for hospital laboratories. It is mainly responsible for the entire process management of test samples, from test application, sample receipt, result entry to report review and release, ensuring the accuracy and traceability of test data.

[0028] RIS (Radiology Information System): A business system designed specifically for medical imaging departments (such as radiology and ultrasound departments) to manage the entire process of imaging examinations, including examination scheduling, equipment and personnel scheduling, patient tracking, and image report generation and storage.

[0029] EMR (Electronic Medical Record): A patient-centered clinical information recording system that comprehensively digitizes patients' medical history, diagnoses, medical orders, examination and test results, progress notes, etc., replacing traditional paper medical records.

[0030] PACS (Picture Archiving and Communication System): The working system of the radiology department, specifically used for the storage, transmission, management and display of medical images (such as CT, MRI, X-ray).

[0031] AI / ML Model Library: A component that integrates multiple pre-trained machine learning models, which can be called on demand to complete intelligent tasks such as classification, prediction, and semantic understanding.

[0032] Data Processing and Scheduler: This is a traditional, deterministic program module.

[0033] Rule engine: This is a mature software component (such as the open-source framework Drools or a self-developed rule matching program).

[0034] With the deepening development of medical informatization, hospitals have widely deployed various business systems such as HIS, LIS, RIS, and EMR. These systems come from different vendors and have undergone multiple version iterations, resulting in severe fragmentation and heterogeneity of clinical data, lacking unified standards in data format, field definition, and coding rules. However, existing data cleaning technologies generally suffer from low efficiency, high cost, low standardization, insufficient intelligence, and poor reusability, severely restricting the integration, sharing, and value extraction of clinical data.

[0035] The first aspect of this application provides an adaptive clinical data cleaning method for cleaning clinical data from different vendors and versions of medical information systems, thereby converting the clinical data into data that meets the requirements. See also Figure 1 The method includes the following steps: S1. Construct a standardized database, wherein the standardized database has a multi-layer structure and includes at least a clinical data standard library and a business dictionary library. The clinical data standard library is used to store target data models based on clinical data standard specifications, and the business dictionary library is used to store metadata dictionaries from different business systems. S2. Access the target business system and, based on the metadata dictionary, automatically generate a mapping table from the metadata of the target business system to the standard data of the target data model; S3. Based on the mapping table, automatically generate the corresponding data cleaning script; S4. Execute the data cleaning script to clean and convert the data of the target business system into standard data that conforms to the target data model.

[0036] The Clinical Data Standard Repository (CDR) is based on nationally or industry-recognized Clinical Data Storage (CDR) standards. It structurally integrates core business domains such as patient visits, costs, medical orders, examinations, laboratory tests, surgical anesthesia, and medical record summaries. It clearly defines the standard data models for each business domain, including field names, data types, primary and foreign key relationships, security levels, business verification rules, and the authoritative standard coding systems (such as ICD-10, LOINC, and SNOMED CT) used in key data tables (e.g., patient information tables and medical order record tables). This forms a target data model with standardized medical data. The Clinical Data Repository (CDR), as a standardized target for cleaning and transformation, stores structured, compliant standard data, rather than raw business data. Its core exists and is implemented as a series of interconnected, well-defined data tables.

[0037] For example, the clinical data standard library adopts structured storage, where data is stored in the form of well-defined tables and fields, rather than free text, and uses standardized encoding. Key medical concepts (such as diagnosis, examination items, and clinical manifestations) are encoded using international / industry standards such as ICD-10, LOINC, and SNOMED CT, achieving global or industry-wide semantic uniformity. The business logic relationships between data are strictly maintained through primary and foreign keys to ensure data integrity, and fields include metadata definitions such as data type, value range, and validation rules to guarantee data quality.

[0038] The adaptive clinical data cleaning method provided in this application can clean and transform business data from various sources (such as metadata in the HIS system) into structured, high-quality standard data that conforms to the above-mentioned example clinical data standard library.

[0039] The business dictionary database systematically collects and organizes data from mainstream domestic and international HIS, LIS, and EMR vendors (such as Neusoft Group, Winning Health, and Donghua Software) and their different versions of systems. For missing or ambiguous annotations in the dictionary, manual completion and standardization are performed. The business dictionary database records the vendor identifier, system version, business domain, physical field name, data type, business meaning description, and local encoding rules for each field, establishing a metadata dictionary covering a wide range of heterogeneous systems. For example, the business dictionary database stores metadata dictionary entries in a structured table format, with each entry uniquely describing a data field in a specific vendor and version of the business system. For instance, for the HIS system V5.0 of vendor "Neusoft Group," the PATIENT_ID field in its PAT_MASTER_INDEX table is recorded in the database as: data type VARCHAR(18), with a manually completed business meaning description of "unique patient identifier," and its local encoding rule is noted as "18-digit number, following the ID card number rule." Meanwhile, for the EMR system V3.2 of the manufacturer "Weining Health", the DIAG_NAME field in its EMR_DIAGNOSIS table is recorded as: data type VARCHAR(100), business meaning described as "diagnosis name text description", and local encoding rule as "free text". Through systematic collection and standardization, this dictionary has established a correspondence between field names and precise business semantics for various heterogeneous systems.

[0040] This application significantly shortens the data cleaning cycle of the new system by reusing the knowledge base, and greatly improves the data processing efficiency. By automating the process, it reduces the amount of manual work in dictionary sorting, mapping analysis, script coding, etc., and significantly reduces the implementation and maintenance costs of data cleaning.

[0041] In this embodiment, the standardized database further includes a mapping relationship library and a standard cleaning script library; the mapping relationship library is used to store verified mapping relationship pairs from metadata fields to standard data fields, and the standard cleaning script library is used to store standardized script templates corresponding to cleaning operations. For example, the standardized database can be constructed as a series of related tables in a relational database (such as MySQL) or a graph database (such as Neo4j).

[0042] The mapping relationship library stores mapping pairs of verified metadata fields to fields in standard data, such as "Source System: HIS_V5.0, Source Field: PATIENT_SEX, Standard Field: Gender, Mapping Status: Confirmed". The standard cleaning script library stores script templates. The scripts support multiple implementation languages such as SQL and Python, and are associated with specific vendor, system version, and business domain information, forming a "script component library" that can be quickly retrieved and combined.

[0043] See Figure 2 This is an exemplary physical hardware environment for the adaptive clinical data cleaning method provided in this application. The environment adopts a layered architecture: the top layer is a business system network containing systems such as HIS, LIS, EMR, and PACS, serving as the data source; data is transmitted encrypted to the core data processing platform via a high-speed internal network (bandwidth ≥ 1Gbps) through a security layer equipped with firewalls, intrusion detection, and strict access control and auditing mechanisms. This platform consists of an application server cluster and a knowledge base management server forming the core computing power, and relies on a distributed storage system (10TB main memory) and storage node cluster to specifically support multi-layered standardized databases, including a CDR clinical data standard library, a business dictionary library, a mapping relationship library, and a standard cleaning script library. In addition, a dedicated technical management terminal is provided, responsible for system maintenance, monitoring, and report generation. The entire hardware environment is interconnected through standard network protocols, forming a secure, efficient, and scalable data processing infrastructure to support the stable operation of the adaptive cleaning and quality optimization method. It is understood that the above example hardware environment is only one of the typical environments in which the method of this application can run; the method is also applicable to physical hardware environments consisting of a single high-performance server, a cloud-based virtualized computing instance, or a containerized microservice cluster.

[0044] See Figure 3 Optionally, in this embodiment of the application, step S2 includes: Step S21: Automatically extract the metadata of the target business system, and perform business semantic parsing on the metadata fields of the target business system in conjunction with the business dictionary library; Step S22: Based on the semantic parsing results, and according to the preset matching rules, determine whether there is a mapping relationship in the mapping relationship library that matches the metadata field of the target business system. If there is, directly use the mapping relationship as the first candidate mapping relationship. Step S23: If no matching mapping relationship exists, a second candidate mapping relationship is generated through the intelligent mapping model, wherein the intelligent mapping model is an artificial intelligence model that takes the metadata fields of the target business system as input and the mapping relationship between the metadata fields of the target business system and the standard data fields of the target data model as output. Step S24: Generate the mapping table based on the first candidate mapping relationship and / or the second candidate mapping relationship.

[0045] For example, based on the semantic parsing results and according to preset matching rules, it is determined whether there exists a mapping relationship in the mapping relationship library that matches the metadata fields of the target business system. Matching can be performed using the vendor and version number of the target business system as conditions to determine if there are historical mapping records with the same vendor and version. Alternatively, it can first determine if there are fields with the same name in the mapping relationship library, and then determine if there are fields with the same data type. If both conditions are met, the mapping relationship is determined to be a match. Fixed matching rules are pre-configured; the rule engine executes the preset rules for automated matching; after a successful match, the system running the method provided in this application directly outputs a highly reliable mapping relationship.

[0046] For example, generating the mapping table based on the first candidate mapping relationship and / or the second candidate mapping relationship may involve using only the first candidate mapping relationship to form the mapping table, or using only the second candidate mapping relationship to form the mapping table, or merging the first candidate mapping relationship and the second candidate mapping relationship to form the mapping table.

[0047] The following is an example: Scenario 1: Target hospital A has newly connected to a "Donghua Software HIS V5.0" system. Step S21 (parse): The system running the method provided in this application parses out that the PATIENT_ID field exists in the PATIENT table of the newly connected business system, and the data type of the field is VARCHAR(18). The processing flow is as follows: Step S22 (rule-based matching): Assume that the preset matching rule is to first determine whether there is a field name with the same name in the mapping relationship library, and then determine whether there is a data type with the same name. If both are met, it is determined to be a matching mapping relationship. Through the field name "PATIENT_ID" and the data type of the field "VARCHAR(18)", a matching mapping record is successfully found: (Donghua Software, V5.0, PATIENT, PATIENT_ID, VARCHAR(18)) -> (Standard Model, Patient Information, Patient Unique Identifier). The system running the method provided in this application adopts this mapping relationship as the first candidate mapping relationship. The process jumps to step S24, and a mapping relationship table is formed from the first candidate mapping relationship. Step S23 (intelligent generation) is skipped.

[0048] Scenario 2: Target hospital B has integrated a brand new "ABC Technology EMR V2.0" system (no record of this vendor's version exists in the mapping database). The processing flow is as follows: Step S21 (Parsing): The system running the method provided in this application parses the PATIENT_INFO table of the newly integrated business system and finds a SEX_CODE field (comment "gender code", value range {'M', 'F'}). Using a business dictionary and natural language processing technology, the system parses the business semantics of this field as "patient's physiological gender". Step S22 (Rule-based Matching): The system running the method provided in this application searches the mapping database and confirms that there is no mapping relationship matching "SEX_CODE". Therefore, the first candidate mapping relationship is empty. Step S23 (Generation): The system running the method provided in this application starts the intelligent mapping model. The intelligent mapping model ultimately outputs the second candidate mapping relationship: (ABC Technology, V2.0, PATIENT_INFO, SEX_CODE) -> (Standard Model, Patient Information, Gender). Step S24 (Merging and Output): Since the first candidate mapping relationship is empty, a mapping relationship table is formed from the second candidate mapping relationship.

[0049] Scenario 3: Target hospital B has integrated a brand new "ABC Technology EMR V2.0" system (no record of this vendor version exists in the mapping database). The processing flow is as follows: Step S21 (Parsing): The system running the method provided in this application parses the newly integrated business system's PATIENT table to find the PATIENT_ID field and the PATIENT_INFO table to find the SEX_CODE field (comment "gender code", value range {'M', 'F'}). Through the business dictionary and natural language processing technology, the business semantics of the SEX_CODE field are parsed as "patient's physiological gender". Step S22 (Rule-based matching): Determine whether there is a mapping relationship matching "PATIENT_ID" and "SEX_CODE". A matching mapping relationship is successfully found: (Donghua Software, V5.0, PATIENT, PATIENT_ID) -> (Standard Model, Patient Information, Patient Unique Identifier), and it is confirmed that there is no mapping relationship matching "SEX_CODE". The system running the method provided in this application adopts the mapping relationship of the PATIENT_ID field as the first candidate mapping relationship. Step S23 (Generation): For the SEX_CODE field, the system running the method provided in this application starts the intelligent mapping model. The intelligent mapping model outputs the second candidate mapping relationship: (ABC Technology, V2.0, PATIENT_INFO, SEX_CODE) -> (Standard Model, Patient Information, Gender). Step S24 (Merging and Output): Since the first candidate mapping relationship and the second candidate mapping relationship exist simultaneously, the system running the method provided in this application will adopt both the first candidate mapping relationship and the second candidate mapping relationship, forming a mapping relationship table from the first candidate mapping relationship and the second candidate mapping relationship.

[0050] By combining three techniques—semantic parsing, rule-based matching mapping reuse, and intelligent model generation—this application achieves efficient and high-precision automated mapping. First, it utilizes a business dictionary to accurately understand the semantics of the source data. Through rule-based matching, it can quickly generate mapping relationships between metadata fields of the target business system and standard data fields of the target data model, eliminating the need for calculations in a field matching model that integrates multimodal features, thus improving the execution efficiency of this method. For unknown scenarios, a reliable mapping is generated through an intelligent mapping model, ensuring coverage. Finally, a unified mapping relationship table is output through a merging strategy, significantly reducing the workload of manual configuration.

[0051] Optionally, the intelligent mapping model includes a field matching model that integrates multimodal features. By analyzing at least two of the semantic features, structural features, data distribution features, and contextual features of the metadata fields, it comprehensively calculates the matching confidence between the metadata fields of the target business system and the standard data fields of the target data model to generate candidate mapping relationships, wherein: Semantic features are semantic vectors generated based on field names and business annotations; Structural features, including data type and length-constrained encoding vectors; Data distribution characteristics are statistical feature vectors obtained based on field value sampling analysis. Contextual features are context vectors generated based on the data table to which a field belongs and the business domain.

[0052] This application embodiment comprehensively considers multiple dimensions of information such as semantics, structure, data distribution, and context of fields by integrating a field matching model with multimodal features. This significantly improves the accuracy and robustness of mapping and effectively overcomes the limitations of traditional methods that rely solely on field name matching. It can accurately parse complex business semantics and effectively distinguish easily confused fields, thereby greatly reducing the reliance on human experience and realizing a leap from manual judgment to automatic machine decision-making. While ensuring high accuracy, it greatly improves the efficiency of data cleaning and initialization.

[0053] Optionally, the field matching model that integrates multimodal features automatically generates and outputs the second candidate mapping relationship based on the preset threshold range where the matching confidence level is located, or the generated second candidate mapping relationship is manually reviewed and then output; after step S2, the model further includes using the judgment result of the manual review as a new training sample for the field matching model that integrates multimodal features, and updating the model parameters in real time or periodically.

[0054] Understandably, the matching confidence interval can be set as needed, including high-confidence and low-confidence intervals. For example, a matching confidence interval greater than 80% is considered high-confidence, and the generated mapping relationship is directly confirmed as the second candidate mapping relationship. A matching confidence interval less than or equal to 80% is considered low-confidence, requiring manual review. Alternatively, a matching confidence interval greater than 90% is considered high-confidence, and the mapping relationship is directly confirmed as the second candidate mapping relationship; otherwise, it is considered low-confidence, and the mapping relationship is submitted for manual review. This application does not limit this. Manual review includes technical personnel confirming the mapping is correct through the interactive interface and accepting it as the second candidate mapping relationship, or technical personnel modifying the mapping through the interactive interface and then using it as the second candidate mapping relationship.

[0055] By employing a confidence-based automated decision-making process and a human-machine collaborative feedback loop, the embodiments of this application achieve efficient and accurate mapping while ensuring continuous self-learning and intelligent optimization of the system.

[0056] The following is an example: Scenario 4: This scenario differs from Scenario 2 only in step S23. Step S23 (Generation): The system running the method provided in this application initiates a field matching model that integrates multimodal features. The field matching model analyzes the multimodal features of the SEX_CODE field, calculates its similarity to fields such as "gender" and "gender code" in the standard model, and obtains a 98% confidence score. Assuming a matching confidence score greater than 80% is considered high matching confidence, the generated mapping relationship is directly confirmed as the second candidate mapping relationship. Therefore, the system running the method provided in this application ultimately outputs the second candidate mapping relationship with a 98% confidence score: (ABC Technology, V2.0, PATIENT_INFO, SEX_CODE) -> (Standard Model, Patient Information, Gender).

[0057] Scenario 5: Target hospital D has integrated a "Donghua Software HIS V6.0" system (the database only contains records for V5.0). The processing flow is as follows: Step S21 (Parsing): The system running the method provided in this application parses out a new field CONTACT_MOBILE (commented "Contact Phone Number") added in version V6.0. Step S22 (Matching based on preset rules): No matching mapping relationship is found for the CONTACT_MOBILE field. The first candidate mapping relationship is empty. Step S23 (Generation): The intelligent mapping model analyzes this field, and the intelligent mapping model may calculate a confidence score of 75% for this field. Assuming that the matching confidence score is less than or equal to 80%, it is considered low confidence and requires manual review. Therefore, the system running the method provided in this application recommends mapping to the "Contact Phone Number" field of the standard model, generating a preliminary second candidate mapping relationship. Step S24 (Merging and Decision): Since the confidence score is in the low confidence range, the system running the method provided in this application does not automatically adopt it, but instead submits the candidate mapping relationship for manual review. During the manual review phase, technicians modify the mapping in the interactive interface and use it as the second candidate mapping relationship. Since the first candidate mapping relationship is empty, the system running the method provided in this application will use the manually confirmed mapping relationship as the final second candidate mapping relationship and output it, forming a mapping relationship table from the second candidate mapping relationship.

[0058] It is understood that, in some embodiments of this application, the intelligent mapping model can be a pre-trained artificial intelligence model that takes the metadata fields of the target business system as input and the mapping relationship between the metadata fields of the target business system and the standard data fields of the target data model as output. Examples include pre-trained artificial intelligence models based on the Transformer architecture, deep neural network models, generative adversarial systems, etc. The model is trained using the metadata fields of the target business system as input and the mapping relationship between the metadata fields of the target business system and the standard data fields of the target data model as output, thereby obtaining the pre-trained intelligent mapping model. Furthermore, this pre-trained model can possess a self-learning mechanism, such as using the judgment results of manual review as new training samples for the intelligent mapping model, updating the model parameters in real time or periodically. This application does not limit this aspect.

[0059] See Figure 4 Optionally, in this embodiment of the application, step S3 includes: Step S31: Construct a knowledge graph for cleaning operations, wherein the nodes of the knowledge graph include data objects, cleaning operators, and business rules, and the edges of the knowledge graph define the relationships between nodes; Step S32: Determine whether there is a historical cleaning script that matches the vendor and version information of the target business system; Step S33: If it exists, then perform the following steps: Step S331: Decompile the historical cleaning script into a first knowledge subgraph, and model the data pattern of the target business system into a second knowledge subgraph; Step S332: Calculate the structural differences between the first knowledge subgraph and the second knowledge subgraph; Step S333: Based on the structural differences and the knowledge graph of the cleaning operation, modify the historical cleaning script to form the data cleaning script; S34. If it does not exist, then perform the following steps: Step S341: Convert the mapping relationships in the mapping relationship table into queries on the knowledge graph of the cleaning operation; Step S342: Infer the required data cleaning operators and execution sequences from the knowledge graph of the cleaning operation; Step S343: Based on the reasoning path, select and assemble script templates from the standard cleaning script library to form the data cleaning script.

[0060] The following is an example: Scenario 5: When there is a reusable historical script (execute steps S321-S323).

[0061] The target hospital E has integrated a "Donghua HIS V6.0" system, requiring the generation of data cleaning scripts for it. The "standard cleaning script library" constructed in this application already contains a validated and efficient historical cleaning script S0 for the "Donghua HIS V5.0" system. Simultaneously, a general knowledge graph for "patient information" cleaning has been constructed in the standardized database, containing operator nodes such as "data type conversion" and "value domain mapping." Processing flow: Step S31 (Graph Constructed): The cleaning operation knowledge graph is ready. Steps S32 and S33 (Reusability Assessment): The system running the method provided in this application identifies that the current target business system "Donghua HIS V6.0" and the "Donghua HIS V5.0" served by the historical script S0 have the same vendor and highly similar versions, determining that "a reusable historical script exists." Step S331 (Graph Representation): The system running the method provided in this application decompiles the historical script S0 into the first knowledge subgraph G1. For example, G1 might contain a node chain: [source table.PATIENT]--(hasField)-->[source field.Sex]--(appliedBy)-->[operator.ValueMap('1'->'male', '2'->'female')]. Simultaneously, the data pattern of the current V6.0 system is modeled as a second knowledge subgraph G2. Step S332 (Calculate the difference): The system running the method provided in this application compares G1 and G2, calculating the structural difference Diff. The only difference found is that in the V6.0 system, the PATIENT table has a newly added field Birth_Place (place of origin), while G1 lacks corresponding cleaning logic. Step S333 (Modify based on the difference and knowledge graph): The system running the method provided in this application locates the position to be modified (i.e., the end of the PATIENT table) based on the Diff, and queries the cleaning operation knowledge graph. It is found that for fields like Birth_Place, which are "free text-type demographic information," the standard cleaning operations are "removing leading and trailing spaces" and "filtering illegal characters." Therefore, the system running the method provided in this application automatically inserts new code snippets calling these two standard operators at the corresponding locations in the S0 script. Ultimately, this generates the adapted new script S. adapted .

[0062] Scenario 6: When there is no reusable historical script (execute steps S324-S326).

[0063] Steps S32 and S34 (Reusability Assessment): Assuming the target business system is a completely new "ABC EMR V1.0" with no historical scripts, the target business system is determined to "not exist". S341 (Graph Query): The system running the method provided in this application transforms the mapping relationship of the gender field (SEX_CODE -> gender) in the mapping relationship table into a query on the knowledge graph of the cleaning operation: "How to clean SEX_CODE to achieve the mapping to gender?" S342 (Graph Reasoning): The knowledge graph reasons based on the query and finds that SEX_CODE is a character type, and the standard field gender is also a character type, but the value range needs to be standardized. Therefore, the operation path is deduced: a ValueMap operator needs to be applied first, with parameters {'M'->'Male', 'F'->'Female'}. S343 (Template Assembly): The system running the method provided in this application selects the corresponding "character value mapping" SQL template from the standard cleaning script library based on the deduced ValueMap operator. Next, the specific mapping relationships {'M'->'Male', 'F'->'Female'} are used as parameters to fill the template, assembling a complete data cleaning script fragment. This process is repeated for all mapping relationships, ultimately piecing them together to form a complete, executable data cleaning script S. n and the data cleaning script S n Store in the standard cleaning script library.

[0064] This application significantly improves the efficiency and quality of data cleaning script generation by constructing a knowledge graph of cleaning operations and designing a "reuse-first" dual-path decision-making mechanism. When reusable historical scripts exist, the system running the method provided in this application can quickly generate optimized scripts through graph difference calculation and intelligent adaptation, greatly improving the implementation efficiency of similar projects. When entirely new scripts need to be generated, the accuracy and completeness of the script logic are ensured through knowledge graph reasoning. This design not only transforms script development from "manual coding" to "automatic assembly and optimization," but also incorporates script generation, reuse, and optimization into the same intelligent framework through the unified knowledge representation of knowledge graphs, forming a sustainable and evolving script knowledge system, ultimately achieving automation, intelligence, and standardization of script development.

[0065] In this embodiment, the method further includes updating the newly generated mapping relationship to the mapping relationship library and updating the newly generated data cleaning script to the standard cleaning script library. The step of updating the newly generated mapping relationship to the mapping relationship library can be performed after the mapping relationship is automatically generated, and the step of updating the newly generated data cleaning script to the standard cleaning script library can be performed after the cleaning script is automatically generated. By updating the newly generated mapping relationship and cleaning script to the database, the method provided in this application possesses self-evolution capabilities, achieving a virtuous cycle where processing efficiency continuously improves with the number of applications.

[0066] See Figure 5 Optionally, in this embodiment of the application, step S3 may be followed by a step of optimizing the newly generated data cleaning script: Step S401: Receive script modification instructions input by technicians via natural language; Step S402: Parse the natural language instructions and convert them into specific modification operations for the data cleaning script; Step S403: Update the data cleaning script according to the modification operation.

[0067] The following is an example: The system running the method provided in this application has automatically generated a data cleaning script for the target business system "Donghua HIS V6.0" to map and clean the patient gender field GENDER (source values 'M' and 'F') to the standard values 'Male' and 'Female'. After reviewing the script, the technicians found that according to the hospital's latest data specifications, the standard values for gender should use the full names "Male" and "Female", not the abbreviations. The technicians do not need to directly edit complex SQL code; they only need to enter the following command in the "Natural Language Command Interaction Window" provided by the system running the method provided in this application: "Change the standard values of the gender field to 'Male' and 'Female'". After receiving the command, the natural language processing module of the system running the method provided in this application immediately parses and converts it to form a precise and executable Abstract Syntax Tree (AST) modification operation. This operation is defined as: locating all nodes in the script where the value after the THEN clause is 'Male' or 'Female', and replacing them with 'Male' and 'Female' respectively. The script engine in the system running the method provided in this application receives and executes the above-mentioned AST modification operation: accurately finds two target nodes in the script's abstract syntax tree: THEN'male' and THEN'female'; replaces the values of these two nodes with 'male' and 'female' respectively; and regenerates executable script code based on the modified AST.

[0068] This application process frees technical personnel from tedious code searching and syntax modification, significantly reducing the technical threshold and time cost of script maintenance and optimization, while ensuring the accuracy and consistency of modifications, fully demonstrating the ease of use and intelligence of this application.

[0069] Optionally, after step S4, the method further includes: performing a multi-dimensional quality assessment on the cleaned standard data; when the assessment result does not meet the preset standard, analyzing the cleaning execution log to locate the cleaning step that caused the quality problem and generating a correction suggestion.

[0070] For example, this application employs the following data quality assessment and root cause analysis model to perform multi-dimensional and interconnected assessments of data quality after cleaning, and to intelligently locate the source of problems: Multi-dimensional quality assessment matrix: Defines quality dimensions such as completeness (C), accuracy (A), consistency (U), and timeliness (T). The assessment model is not a simple weighted average, but a collaborative matrix calculation. Q = W·V + Λ; In this model, Q is the overall quality score, V is the basic quality indicator vector (e.g., V=[C,A,U,T]), W is the dynamic weight matrix reflecting the interrelationships between the indicators (e.g., inconsistency U may have a negative impact on accuracy A), and Λ is the correction term. This model can more realistically reflect the overall state of data quality.

[0071] Process tracing and root cause recommendation: When the final quality score Q is lower than the threshold, the system running the method provided in this application initiates root cause analysis. By analyzing the cleaning execution log (which records each operation and its input / output snapshots), combined with a Bayesian network or decision tree model, the posterior probability of each cleaning step leading to quality problems is calculated, and the most likely 1-3 root cause steps and correction suggestions are ranked and output.

[0072] The following is an example: After data cleaning and transformation are completed, the method initiates a quality assessment step. The system running the method provided in this application automatically performs multi-dimensional checks on the output standard data according to preset rules, such as checking the completeness of fields (e.g., whether the "patient age" field is missing), consistency (e.g., whether the "admission time" is later than the "discharge time"), and value range compliance (e.g., whether the "gender" field value exceeds the range of 'male,' 'female,' or 'unknown'). Suppose the quality assessment model detects an anomaly in the "patient age" field: in a large number of records, the value of this field is greater than 150 years, which does not meet the preset "reasonableness" standard (e.g., age ≤ 120 years). At this point, the assessment result triggers a root cause tracing process. The system running the method provided in this application automatically retrieves and analyzes the execution log of this batch of data cleaning. The log records in detail each cleaning operation, the fields involved, the applied transformation rules, and the data snapshot at that time. Through analysis, the system running the method provided in this application identified the root cause of the problem: in the "value mapping" cleaning step, the rule used to map the code '999' (representing "unknown age") of the source system's AGE field to the standard value 'NULL' was not triggered correctly, causing '999' to be incorrectly retained and used as an actual value in subsequent calculations. Based on this identification, the system running the method provided in this application automatically generates and outputs a clear correction suggestion: "In the cleaning rules for the 'patient age' field, supplement the mapping processing for the source value '999' to convert it to the standard null value (NULL)." Based on this suggestion, technicians can quickly make corrections in the cleaning rule library or script template.

[0073] Understandably, this process forms a complete closed loop from identifying quality issues to locating the root cause of the cleansing process, and then to optimizing the guiding rules, thus achieving continuous governance of data quality and self-improvement of the cleansing process.

[0074] Another aspect of this application provides a clinical data adaptive cleaning system, including: The knowledge base management module is used to build and maintain a standardized database, wherein the standardized database has a multi-layer structure and includes at least a clinical data standard library and a business dictionary library. The clinical data standard library is used to store target data models based on clinical data standard specifications, and the business dictionary library is used to store metadata dictionaries from different business systems. The intelligent mapping module is communicatively connected to the knowledge base management module. It is used to access the target business system and automatically generate a mapping relationship table from the metadata of the target business system to the standard data of the target data model based on the metadata dictionary. The script generation and execution module is communicatively connected to the intelligent mapping module and is used to automatically generate corresponding data cleaning scripts based on the mapping relationship table and execute cleaning operations. The quality assessment and closed-loop learning module is communicatively connected to the intelligent mapping module, the script generation and execution module, and the knowledge base management module. It is used to assess the quality of the cleaned standard data and update the newly generated mapping relationship and the newly generated data cleaning script to the standardized database.

[0075] Optionally, in this embodiment, the intelligent mapping module may include a metadata parsing module, a feature extraction module, a rule matching module, and an intelligent mapping model. The metadata parsing module and the feature extraction module are configured to perform business semantic parsing on the metadata fields of the target business system using a business dictionary. The rule matching module is configured to, based on the semantic parsing results and according to preset matching rules, determine whether a mapping relationship exists in the mapping relationship library that matches the metadata fields of the target business system; if such a mapping relationship exists, it is directly adopted as the first candidate mapping relationship. The intelligent mapping model is configured to generate a second candidate mapping relationship.

[0076] Optionally, in this embodiment, the script generation and execution module includes a script generation model configured to perform the following operations based on a pre-built knowledge graph of cleaning operations: Determine if a historical data cleaning script exists that matches the vendor and version information of the target business system; if it exists, execute: The historical cleaning scripts are decompiled into the first knowledge subgraph, and the data patterns of the target business system are modeled into the second knowledge subgraph. Calculate the structural differences between the first knowledge subgraph and the second knowledge subgraph; Based on structural differences and a knowledge graph of cleaning operations, historical cleaning scripts are modified to form data cleaning scripts; If it does not exist, then execute: Transform the mapping relationships in the mapping relationship table into queries on the knowledge graph of cleaning operations; The required data cleaning operators and execution sequences are inferred from the knowledge graph of the cleaning operation. Based on the reasoning path, script templates are selected and assembled from the standard cleaning script library to form a data cleaning script.

[0077] For example, the automated process in the adaptive clinical data cleaning system can be implemented by an intelligent engine built with the following technologies, and this application does not limit the specific implementation of the intelligent engine.

[0078] The data processing and scheduler is responsible for scheduling the entire cleaning process, such as sequentially calling the metadata parsing module, feature extraction module, rule matching module, and intelligent mapping model to ensure data consistency.

[0079] The rule matching module can employ a rule engine, which internally stores predefined rules in the form of "IF-THEN". For example, IF field_name == "PatientID" THEN directly maps to the standard field "Patient Unique Identifier". The rule engine performs pattern matching, a fast logical judgment, and is a form of deterministic automation.

[0080] AI / ML model libraries are probabilistic, data-driven components. In this application, they specifically refer to natural language processing models, such as BERT or ClinicalBERT, used to understand the semantics of field names and annotations and convert them into semantic vectors. This is the foundation for achieving "intelligent parsing." Intelligent mapping models, such as gradient boosting decision trees (XGBoost) or deep neural networks (DNN), are used to integrate features such as semantics, structure, and data distribution to calculate the similarity confidence between fields. This is the core of automatically generating mapping relationships. Script generation models, such as those based on graph databases (e.g., Neo4j) and graph traversal algorithms, infer the required operation sequences from the knowledge graph of cleaning operations. This is the key to automatically generating cleaning scripts. Large language models, such as GPT and DeepSeek, are used to implement human-computer collaborative optimization steps for modifying scripts through natural language input. This application innovatively proposes intelligent mapping models for automatically generating mapping relationships and script generation models for automatically generating cleaning scripts.

[0081] In some embodiments of this application, the intelligent mapping model is implemented using the following field matching model that integrates multimodal features.

[0082] The model input is: Semantic Features (SemVec): Using a language model (such as ClinicalBERT) pre-trained on medical texts, field names and business annotations are converted into high-dimensional semantic vectors; Structural features (StrVec): Encode the data type (numeric, character, date, etc.), length constraints, and whether a field is a key into a feature vector; Data distribution characteristics (DistVec): Sampling analysis of field values is performed to extract statistical characteristics such as value range distribution, enumerated value set, and null value rate, forming a data profile vector; Contextual Features (CtxVec): Utilizing knowledge graph technology, a vector representing the business context of a field is generated based on the data table and business domain to which the field belongs, as well as its relationship with other fields.

[0083] The calculation process is as follows: The metadata field F... s and target standard data field F t The aforementioned feature vectors are concatenated and input into a lightweight deep neural network for fusion and decision-making.

[0084] The final output of this network is the matching confidence score: Where, N θ The model represents a trained neural network, with θ as its parameters. In one embodiment of this application, the field matching model that integrates multimodal features introduces an active learning strategy. When the confidence level is in the "fuzzy interval" (medium confidence level), the system submits the candidate mapping relationship for manual judgment and uses the judgment result as a new training sample to update the model parameters θ in real time or periodically, enabling the model to achieve the effect of "becoming more accurate with use".

[0085] In some embodiments of this application, the script generation model is implemented using the following knowledge graph-based script logic reasoning and assembly model.

[0086] Construct a knowledge graph for cleaning operations expressed in the form of "entity-relationship". Nodes: include data objects (such as tables, fields), cleaning operators (such as FormatConvert, ValueMap), and business rules (such as NotNullConstraint, CodeStandard). Edges: define the relationships between nodes, such as hasField (the table owns a field), applyTo (the operation is applied to a field), consttrainBy (the operation is constrained by a rule), nextStep (the order in which operations are executed), etc.

[0087] Script generation path reasoning: Once a set of field mapping relationships is determined, the intelligent engine transforms it into a query of the knowledge graph. For example, for the mapping "Source table A. Field X -> Standard table B. Field Y", the system automatically infers the following path in the graph: Locate source field X -> Find the required operator node (such as DataTypeConvert, TerminologyMap) through the applyTo edge -> Check whether these operators are subject to the specific rule node constrainBy -> Determine the operation execution sequence based on the nextStep edge. This reasoning path directly maps to the logical structure of the script.

[0088] Script optimization and conflict resolution logic: When reusing historical script S0, the system performs advanced differential and adaptation: G0 = DecompileToGraph(S0) (Decompiles the historical script into the first knowledge subgraph) Gc = CurrentSchemaToGraph (Models the current data table structure as a second knowledge subgraph) Gdiff = GraphDiff(Gc, G0) (Calculates the structural difference between the first and second knowledge subgraphs) Sc′=AdaptByGraph(S0, Gdiff) (Finds equivalent replacements based on the graph and generates the adapted script) Sc = ResolveConflict(Sc′) (Detects and resolves logical conflicts, such as rule contradictions, etc.) The process of script optimization and conflict resolution logic enables precise and automated script tuning at the level of complex logic.

[0089] It is understood that the clinical data adaptive cleaning system provided in this application also has the beneficial effects and advantages of the clinical data adaptive cleaning method, which will not be elaborated upon here.

[0090] The above are merely exemplary descriptions of the embodiments of this application. It should be noted that the embodiments described in this application are for the purpose of more clearly illustrating the technical solutions of the embodiments of this application, and do not constitute a limitation on the technical solutions provided by the embodiments of this application. As those skilled in the art will know, with the evolution of technology and the emergence of new application scenarios, the technical solutions provided by the embodiments of this application are also applicable to similar technical problems.

[0091] In the foregoing description of this specification, references to terms such as "one embodiment," "another embodiment," or "some embodiments," etc., indicate that a specific feature, structure, material, or characteristic described in connection with an embodiment is included in at least one embodiment of this application. In this specification, illustrative expressions of the above terms do not necessarily refer to the same embodiment. Furthermore, the specific features, structures, materials, or characteristics described may be combined in any suitable manner in one or more embodiments.

[0092] Although embodiments of this application have been shown and described, those skilled in the art will understand that various changes, modifications, substitutions and variations can be made to these embodiments without departing from the principles and spirit of this application, the scope of which is defined by the claims and their equivalents.

[0093] The above is merely an exemplary description of the embodiments of this application. This application is not limited to the embodiments described. Those skilled in the art can make equivalent modifications or substitutions without departing from the spirit of this application. All such equivalent modifications or substitutions are included within the scope defined by the claims of this application.

Claims

1. An adaptive cleaning method for clinical data, characterized in that, Includes the following steps: Step S1: Construct a standardized database, wherein the standardized database has a multi-layer structure and includes at least a clinical data standard library and a business dictionary library. The clinical data standard library is used to store target data models based on clinical data standard specifications, and the business dictionary library is used to store metadata dictionaries from different business systems. Step S2: Access the target business system and, based on the metadata dictionary, automatically generate a mapping table from the metadata of the target business system to the standard data of the target data model; Step S3: Based on the mapping table, automatically generate the corresponding data cleaning script; Step S4: Execute the data cleaning script to clean and convert the data of the target business system into standard data that conforms to the target data model.

2. The method as described in claim 1, characterized in that, The standardized database also includes a mapping relationship library and a standard cleaning script library; the mapping relationship library is used to store the confirmed mapping relationship pairs from metadata fields to standard data fields, and the standard cleaning script library is used to store standardized script templates corresponding to cleaning operations.

3. The method as described in claim 2, characterized in that, Step S2 includes: Step S21: Automatically extract the metadata of the target business system, and perform business semantic parsing on the metadata fields of the target business system in conjunction with the business dictionary library; Step S22: Based on the semantic parsing results, according to the preset matching rules, determine whether there is a mapping relationship in the mapping relationship library that matches the metadata field of the target business system. If there is, the mapping relationship is directly used as the first candidate mapping relationship. Step S23: If no matching mapping relationship exists, a second candidate mapping relationship is generated through the intelligent mapping model, wherein the intelligent mapping model is an artificial intelligence model that takes the metadata fields of the target business system as input and the mapping relationship between the metadata fields of the target business system and the standard data fields of the target data model as output. Step S24: Generate the mapping table based on the first candidate mapping relationship and / or the second candidate mapping relationship.

4. The method as described in claim 3, characterized in that, The intelligent mapping model includes a field matching model that integrates multimodal features. It analyzes at least two of the semantic, structural, data distribution, and contextual features of metadata fields to comprehensively calculate the matching confidence between the metadata fields of the target business system and the standard data fields of the target data model, thereby generating the second candidate mapping relationship. Semantic features are semantic vectors generated based on field names and business annotations; Structural features, including data type and length-constrained encoding vectors; Data distribution characteristics are statistical feature vectors obtained based on field value sampling analysis. Contextual features are context vectors generated based on the data table to which a field belongs and the business domain.

5. The method as described in claim 4, characterized in that, The field matching model that integrates multimodal features automatically generates and outputs the second candidate mapping relationship based on the preset threshold range where the matching confidence level is located, or the generated second candidate mapping relationship is manually reviewed and then output. Step S2 is followed by using the judgment result of manual review as a new training sample for the field matching model that integrates multimodal features, and updating the model parameters in real time or periodically.

6. The method as described in claim 2, characterized in that, Step S3 includes: Step S31: Construct a knowledge graph for cleaning operations, wherein the nodes of the knowledge graph include data objects, cleaning operators, and business rules, and the edges of the knowledge graph define the relationships between nodes; Step S32: Determine whether there is a historical cleaning script that matches the vendor and version information of the target business system; Step S33: If it exists, then perform the following steps: Step S331: Decompile the historical cleaning script into a first knowledge subgraph, and model the data pattern of the target business system into a second knowledge subgraph; Step S332: Calculate the structural differences between the first knowledge subgraph and the second knowledge subgraph; Step S333: Based on the structural differences and the knowledge graph of the cleaning operation, modify the historical cleaning script to form the data cleaning script; S34. If it does not exist, then perform the following steps: Step S341: Convert the mapping relationships in the mapping relationship table into queries on the knowledge graph of the cleaning operation; Step S342: Infer the required data cleaning operators and execution sequences from the knowledge graph of the cleaning operation; Step S343: Based on the reasoning path, select and assemble script templates from the standard cleaning script library to form the data cleaning script.

7. The method as described in claim 1, characterized in that, The method further includes updating the newly generated mapping relationship to the mapping relationship library; and updating the newly generated data cleaning script to the standard cleaning script library.

8. The method as described in claim 1, characterized in that, Step S3 is followed by a step of optimizing the newly generated data cleaning script: Step S401: Receive script modification instructions input by technicians via natural language; Step S402: Parse the natural language instructions and convert them into specific modification operations for the data cleaning script; Step S403: Update the data cleaning script according to the modification operation.

9. The method as described in claim 1, characterized in that, The step S4 is followed by: performing a multi-dimensional quality assessment on the cleaned standard data; when the assessment results do not meet the preset standards, analyzing the cleaning execution log to locate the cleaning steps that caused the quality problems and generating correction suggestions.

10. A clinical data adaptive cleaning system, characterized in that, The system for implementing the adaptive clinical data cleaning method as described in any one of claims 1 to 9 includes: The knowledge base management module is used to build and maintain a standardized database, wherein the standardized database has a multi-layer structure and includes at least a clinical data standard library and a business dictionary library. The clinical data standard library is used to store target data models based on clinical data standard specifications, and the business dictionary library is used to store metadata dictionaries from different business systems. The intelligent mapping module is communicatively connected to the knowledge base management module. It is used to access the target business system and automatically generate a mapping relationship table from the metadata of the target business system to the standard data of the target data model based on the metadata dictionary. The script generation and execution module is communicatively connected to the intelligent mapping module and is used to automatically generate corresponding data cleaning scripts based on the mapping relationship table and execute cleaning operations. The quality assessment and closed-loop learning module is communicatively connected to the intelligent mapping module, the script generation and execution module, and the knowledge base management module. It is used to assess the quality of the cleaned standard data and update the newly generated mapping relationship and the newly generated data cleaning script to the standardized database.