Adaptive preprocessing and quality control system and method for multi-source heterogeneous medical data
By combining a data perception and routing engine, a preprocessing pipeline cluster, and a closed-loop quality control module, the problem of adaptive processing of multi-source heterogeneous medical data is solved, achieving efficient and intelligent data cleaning and quality control, and improving data quality and the reliability of clinical decisions.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- WESTERN INTELLIGENT CORE (CHONGQING) BIOTECHNOLOGY CO LTD
- Filing Date
- 2026-03-16
- Publication Date
- 2026-06-19
Smart Images

Figure CN122245581A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the fields of medical information technology and data preprocessing technology, specifically to an adaptive preprocessing and quality control system and method for multi-source heterogeneous medical data. Background Technology
[0002] With the widespread adoption of Hospital Information Systems (LIS), Point-of-Care Testing (POCT), and Electronic Medical Records (EMR), clinical laboratory data is experiencing explosive growth, exhibiting diverse sources and heterogeneous formats. While existing technologies (such as prior art document CN120579026A) propose data cleaning and standardization, their processes are typically static and fixed. Faced with different data sources (such as device-exported messages, manually entered tables, and free-text reports), static processes cannot adapt, leading to the following problems: 1. Poor cleaning effect: Rules effective for device data may disrupt semantic structure when applied to text reports, resulting in information loss or distortion. 2. Low efficiency: Manual pre-judgment of data types and configuration of processing pipelines are required, resulting in low automation and high labor costs. 3. Delayed quality control: Data quality issues are often only discovered in subsequent analysis stages, leading to high costs for tracing and remediation, and impacting the timeliness of clinical decision-making.
[0003] Traditional preprocessing methods have the following limitations: they lack the ability to intelligently identify multi-source data; they cannot dynamically adjust processing strategies based on data characteristics; quality control is disconnected from the processing flow; and they are unable to cope with the complexity and diversity of clinical laboratory data. Summary of the Invention
[0004] To overcome the shortcomings of existing technologies, one of the objectives of this invention is to provide an adaptive preprocessing and quality control system for multi-source heterogeneous medical data, thereby optimizing the process from "manual configuration" to "intelligent adaptation" and comprehensively improving the efficiency and quality of medical test data preprocessing.
[0005] The technical solution adopted in this invention is as follows: it includes a data perception and routing engine, a preprocessing pipeline cluster, and a closed-loop quality control module. The data perception and routing engine is used to identify the type and quality of input data and perform dynamic routing accordingly. The preprocessing pipeline cluster is used to process data of different types or qualities. The closed-loop quality control module is used to evaluate the data quality in real time during the preprocessing process of the preprocessing pipeline cluster, obtain a score, and output data or trigger a feedback mechanism based on the score.
[0006] The principle and beneficial effects of the technical solution: Identifying the type and quality of input data includes identifying the source type and data structure of the input data, assessing the quality overview, and then dynamically allocating it to a preprocessing pipeline cluster for classification processing based on the identification results and the assessed quality score; the preprocessing pipeline cluster processes data of different types or qualities, outputting, parsing, or repairing it; the closed-loop quality control module assesses the data quality in real time, derives a score, and outputs data or triggers a feedback mechanism based on the score, transforming post-processing quality control into real-time in-process quality control, which can improve the quality of the output data; this invention utilizes the identification of the type and quality of input data to classify the input data, then pre-assesses the quality, and then processes the classified data using a preprocessing pipeline cluster. During the processing, a closed-loop quality control module performs real-time evaluation, i.e., secondary evaluation, and outputs or triggers feedback after evaluation, thereby enabling input data classification, evaluation, preprocessing, and secondary evaluation, significantly improving output quality while greatly reducing manual operations and improving processing efficiency.
[0007] In a preferred embodiment of the present invention, the data perception and routing engine includes a lightweight deep learning classifier, which is used to classify the source type and structure of the data.
[0008] In a preferred embodiment of the present invention, the data perception and routing engine includes a feature extraction submodule, a type identification submodule, a quality assessment submodule, and a routing decision submodule. The feature extraction submodule is used to extract surface features of the received data; the type identification submodule is used to classify the extracted surface features; the quality assessment submodule is used to calculate the preliminary quality score of the data; and the routing decision submodule is used to combine the type identification result and the quality score to perform dynamic routing.
[0009] The principle and beneficial effects of the technical solution: After receiving the raw data, the feature extraction submodule first extracts its surface features, such as: file format (.csv, .txt), field separator, presence of predefined header, proportion of free text, proportion of missing values, etc.; the type recognition submodule uses a lightweight deep learning classifier to classify the extracted surface features and outputs such as: "standard equipment data", "manually entered form", "free text report", "image report", etc.; the quality assessment submodule can calculate the preliminary quality score of the data based on the missing rate, the proportion of outliers (according to the preset medical reasonable range), etc.; the routing decision submodule.
[0010] In a preferred embodiment of the present invention, the type recognition submodule adopts the TextCNN or BERT-mini model, and the classification categories include standard equipment data, manually entered forms, free text reports, and image reports.
[0011] In a preferred embodiment of the present invention, the preprocessing pipeline cluster includes at least a standard structured data pipeline, a text report parsing pipeline, and a low-quality data augmentation and repair pipeline. The standard structured data pipeline is used to process standard equipment data; the text report parsing pipeline is used to parse unstructured text; and the low-quality data augmentation and repair pipeline is used to process data with high missing rates and high outlier values.
[0012] In a preferred embodiment of the present invention, the text report parsing pipeline uses the BERT model for named entity recognition and combines it with a medical knowledge graph for entity linking and normalization.
[0013] In a preferred embodiment of the present invention, the low-quality data augmentation and repair pipeline uses the KNN algorithm to repair missing values based on similar patient records, or applies a generative model for data filling.
[0014] In a preferred embodiment of the present invention, the closed-loop quality control module includes a quality scoring card and a feedback actuator. The quality scoring card scores data in real time based on the completeness, consistency and reasonableness of the data. The feedback actuator performs strategy switching, process rerouting or manual alarm according to the scoring results.
[0015] Compared with existing technologies, the beneficial effects of this invention are: 1. Intelligent process: It realizes "data-driven" process routing without manual pre-configuration; it solves the adaptation problem of multi-source heterogeneous data processing; and it reduces the workload of data engineers.
[0016] 2. Closed-loop quality control: It transforms post-event quality control into real-time quality control during the process; it can trigger a self-repair process and has self-healing capabilities; it ensures the quality of data input to downstream AI models from the source; and it improves the reliability of the entire clinical decision support system.
[0017] 3. Significantly improved efficiency: Through parallel pipeline clusters and intelligent routing, the "one-size-fits-all" processing method is avoided; the most refined algorithms are used for different data characteristics; preprocessing efficiency is improved by about 60%, and manual intervention is reduced by 80%.
[0018] The second objective of this invention is to provide an adaptive preprocessing and quality control method for multi-source heterogeneous medical data, applied to the system, comprising: S1: inputting multi-source data; S2: identifying the type and quality profile of the input data through a data perception and routing engine; S3: dynamically routing the data to the corresponding sub-pipeline in the preprocessing pipeline cluster based on the identification results; S4: performing real-time quality assessment through a closed-loop quality control module during sub-pipeline processing and obtaining a score; S5: determining whether the real-time quality assessment result meets the standard; if it does, proceed to S6; otherwise, proceed to S7; S6: outputting high-quality standard data; S7: triggering a feedback actuator, which selects process jump or manual alarm based on the score.
[0019] In a preferred embodiment of the present invention, in S7, process jump: trigger process rerouting to direct data to the low-quality data enhancement and repair pipeline in S3; manual alarm: send an alarm to the system administrator, along with a data snapshot and quality report, requesting manual review; strategy switching can be performed simultaneously during process jump. Attached Figure Description
[0020] ∈ Figure 1 This is a flowchart of the adaptive preprocessing and quality control system for multi-source heterogeneous medical data according to the present invention. ∈ Figure 2 This is a flowchart of the adaptive preprocessing and quality control method for multi-source heterogeneous medical data according to the present invention. Detailed Implementation
[0021] Typical embodiments embodying the features and advantages of the present invention will be specifically described in the following description. It should be understood that the present invention can have various variations in different embodiments without departing from the scope of the present invention, and the descriptions and illustrations herein are for illustrative purposes only and not intended to limit the present invention.
[0022] In the description of this application, the terms "first," "second," "side," etc., indicate the orientation or positional relationship based on the orientation or positional relationship shown in the accompanying drawings. They are used only for the convenience of describing this application and simplifying the description, and do not indicate or imply that the structure referred to must have a specific orientation, or be constructed and operated in a specific orientation. Therefore, they should not be construed as limitations on this application.
[0023] The present invention will now be described in detail with reference to the accompanying drawings and embodiments.
[0024] As attached Figure 1As shown, this invention provides an adaptive preprocessing and quality control system for multi-source heterogeneous medical data: It includes a data perception and routing engine, which serves as the system's intelligent entry point. This engine automatically identifies the source type and data structure of the input data, assesses its quality profile, and dynamically allocates it to specific preprocessing sub-pipelines within the preprocessing pipeline cluster based on the identification results and the assessed quality score. Specifically, it includes: 1. A feature extraction submodule: After receiving the raw data, the engine first extracts its surface features, such as file format (.csv, ...). 1. **Data Type Assessment Submodule:** This submodule performs the following checks: .txt file name, field separator, presence of a predefined header, percentage of free text, percentage of missing values, etc. 2. **Data Type Recognition Submodule:** This submodule builds a lightweight deep learning classifier (such as TextCNN or BERT-mini) to classify the extracted surface features, outputting data such as "Standard Equipment Data," "Manually Entered Tables," "Free Text Reports," "Image Reports," etc. 3. **Data Quality Assessment Submodule:** This submodule calculates preliminary data quality scores in parallel, based on factors such as missing rate and outlier percentage (according to a predefined medically acceptable range). 4. **Data Routing Decision Submodule:** This submodule integrates the data type recognition results and quality scores to perform dynamic routing; that is, it dynamically allocates data to specific preprocessing sub-pipelines within the preprocessing pipeline cluster based on the recognition results.
[0025] The lightweight deep learning classifier aims to balance accuracy and efficiency to meet the real-time requirements of the system's entry point, and its construction follows these steps: 1. Data preparation and feature engineering: 1) Collect historical multi-source medical test data to form a training dataset.
[0026] 2) Clean and label the data, with labels such as "Standard Equipment Data", "Manually Entered Table", "Free Text Report", and "Image Report".
[0027] 3) For structured / semi-structured data (such as equipment data, tables), extract meta-features, such as: number of fields, field name regularity, proportion of numeric fields, and whether there are predefined delimiters.
[0028] 4) For text and image reports, preprocessing is performed. Text data uses word embeddings (such as Word2Vec) or directly uses sub-word units from pre-trained models; image data is normalized in size.
[0029] 2. Model Selection and Lightweighting: 1) TextCNN is the preferred choice: For mixed data with obvious textual features, TextCNN can efficiently capture local semantic features, has a small number of parameters, and fast inference speed. Construct a network containing convolutional layers, pooling layers, and fully connected layers.
[0030] 2) Alternative BERT-mini: If high semantic understanding is required, a lightweight variant of BERT (such as a 6-layer, 4-head attention mechanism) can be used. Knowledge can be transferred from a large BERT model through knowledge distillation technology, or pre-trained and fine-tuned on medical text corpora to significantly reduce computational resource consumption while maintaining high performance.
[0031] 3. Training and Deployment: 1) The extracted meta-features are fused with the embedded features of the text / image and input into the classifier for training.
[0032] 2) The training objective is to minimize the cross-entropy loss function.
[0033] 3) After the model training is completed, it is packaged into a lightweight service (such as using ONNX Runtime or TensorRT for optimization) and integrated into the "Data Awareness and Routing Engine" to achieve low-latency automatic classification of data types.
[0034] In classifying the extracted surface features, the specific classification is based on the meta-features and content features of the data, as shown in Table 1:
[0035] Table 1: Classification Criteria Based on Meta-features and Content Features of Data The preprocessing pipeline cluster consists of multiple sub-pipelines optimized for specific data types, including at least a standard structured data pipeline, a text report parsing pipeline, and a low-quality data augmentation and repair pipeline.
[0036] Classification methods, for example: Type = "Standard Equipment Data" & Quality Score > 0.8 → Standard Structured Data Pipeline.
[0037] Type = "Free Text Report" → Text Report Parsing Pipeline.
[0038] Quality score <0.6 → Low-quality data enhancement and repair pipeline.
[0039] Among them, 1. The standard structured data pipeline is used to perform routine normalization, LOINC encoding mapping and outlier removal based on the 3σ / IQR principle; the specific steps are: 1) Data normalization: unify the values of the same test item from different devices to standard units and dimensions.
[0040] 2) LOINC encoding mapping: Based on the name and unit of the test item, a matching algorithm is used to map it to the LOINC standard terminology library to achieve semantic standardization.
[0041] 3) Outlier detection and removal: ① Based on statistical distribution (3σ principle): For values that conform to a normal distribution (such as electrolytes), calculate their mean (μ) and standard deviation (σ). Any data point that falls outside the range of [μ-3σ, μ+3σ] will be regarded as an extreme outlier and removed.
[0042] ② Based on the interquartile range (IQR principle): For skewed distribution values (such as tumor markers), calculate the first quartile (Q1), the third quartile (Q3), and the interquartile range (IQR = Q3 - Q1). Any data point falling outside the range of [Q1 - 1.5 × IQR, Q3 + 1.5 × IQR] will be considered a moderate outlier and removed.
[0043] ③ Based on medically reasonable range: Regardless of the statistical results, all values must be within the known medically reasonable range (e.g., adult blood glucose concentration is usually between 3.9-11.1 mmol / L). Data that exceeds this rigid range will be directly marked as erroneous data.
[0044] Removal criteria: ① Values that simultaneously meet both statistical abnormality and medical incompatibility criteria will be automatically removed first.
[0045] ② Values that are statistically abnormal but fall within the medically reasonable range can be manually reviewed by the quality control module instead of being directly removed, in order to avoid mistakenly deleting valid clinical data.
[0046] 2. The text report parsing pipeline utilizes the BERT model for named entity recognition, extracting key test items and values, and combining this with a medical knowledge graph for entity linking and normalization; BERT model recognition and extraction: Named Entity Recognition (NER): 1) Input the text report into a BERT model finely tuned on a medical corpus.
[0047] 2) The model performs sequence labeling on each token and identifies: ① Test items: such as "white blood cells", "total bilirubin", "troponin I".
[0048] ② Numerical value: such as "12.5" ">1000".
[0049] ③ Unit: such as "×10" 9 / L", "μmol / L".
[0050] ④ Abnormal signs: such as "↑", "↓", "elevation", "negative".
[0051] Entity Linking and Normalization: 1) Match the identified test item strings with the medical knowledge graph (such as concepts containing LOINC and UMLS) and map them to standard concepts.
[0052] 2) Convert numerical values and units to standard units and formats.
[0053] 3. The low-quality data augmentation and repair pipeline uses the KNN algorithm to repair missing values based on similar patient records, or applies generative models to fill in reasonable data.
[0054] KNN algorithm for repairing missing values: 1) Feature construction: For a patient record with missing values, select other complete and representative test indicators as feature vectors.
[0055] 2) Similarity calculation: In the full historical dataset, calculate the Euclidean distance or cosine similarity between the record and all other complete records.
[0056] 3) Neighbor determination: Select the K most similar complete patient records (K value is determined by cross-validation).
[0057] 4) Missing value imputation: ① For continuous values: use the weighted average of the K nearest neighbor corresponding field values to imput (the weight is proportional to the similarity).
[0058] ② For categorical data: use the mode (most frequent value) of the corresponding field among the K nearest neighbors to fill the data.
[0059] Generative model imputation: ① For complex missing patterns, a generative model (such as VAE or GAN) can be trained to learn the joint probability distribution of the complete test data.
[0060] ② When encountering missing values, the generative model generates the most reasonable missing values that conform to the data distribution, using the previously constructed "feature construction" as a condition.
[0061] It also includes a closed-loop quality control module, which is embedded in the key nodes of each sub-pipeline to perform real-time quality scoring on the data being processed (such as setting quality scoring cards after "missing value processing" and "outlier removal"). The scoring card calculates the current data quality score according to preset rules (such as completeness, consistency, and uniqueness). If the score is lower than the threshold, a feedback mechanism is triggered to automatically switch the processing algorithm or notify manual intervention.
[0062] The feedback mechanism includes a feedback executor. If the score is below the threshold, the feedback executor is triggered. Its executable operations include strategy switching, process jump, and manual alarm.
[0063] Strategy switching: For example, automatically switching from "mean fill" to "KNN fill".
[0064] Process redirection: For example, rerouting data to the "enhancement repair pipeline".
[0065] Manual alerts: Send alerts to the system administrator along with a data snapshot and quality report, requesting manual review.
[0066] The quality scoring mechanism is shown in Table 2 below: The quality scoring card uses a weighted summation method to perform real-time quantitative evaluation of multiple quality dimensions.
[0067] Example of scoring determination: Assume the total score S is 100 points.
[0068]
[0069] Table 2 Example of scoring criteria Threshold setting and feedback: ① The total score threshold can be set to 85 points.
[0070] ②S≥85: The data quality is excellent and will proceed to the next stage.
[0071] ③60≤S<85: The quality is average. The feedback actuator triggers process rerouting, directing it to the "low-quality data augmentation and repair pipeline".
[0072] ④S<60: Poor quality. The feedback actuator triggers a manual alarm, notifying the data engineer to intervene.
[0073] These thresholds and scoring weights can be dynamically adjusted and optimized based on specific application scenarios and historical data quality performance.
[0074] This invention also provides an adaptive preprocessing and quality control method for multi-source heterogeneous medical data, applied to the adaptive preprocessing and quality control system for multi-source heterogeneous medical data, comprising: S1: inputting multi-source data; S2: identifying the type and quality profile of the input data through a data perception and routing engine; S3: dynamically routing the data to the corresponding sub-pipeline in the preprocessing pipeline cluster according to the identification results; S4: performing real-time quality assessment through a closed-loop quality control module during sub-pipeline processing and obtaining a score; S5: determining whether the real-time quality assessment result meets the standard; if it does, proceed to S6; if it does not, proceed to S7; S6: outputting high-quality standard data; S7: triggering a feedback actuator, which selects process jump or manual alarm based on the score.
[0075] In S7, process jump: trigger process rerouting, directing data to the low-quality data enhancement and repair pipeline in S3; manual alarm: send an alarm to the system administrator, along with a data snapshot and quality report, requesting manual review; strategy switching can be performed synchronously during process jump, for example, automatically switching from "mean fill" to "KNN fill".
[0076] The above embodiments are merely preferred embodiments of the present invention and should not be construed as limiting the scope of protection of the present invention. Any non-substantial changes and substitutions made by those skilled in the art based on the present invention shall fall within the scope of protection claimed by the present invention.
Claims
1. An adaptive preprocessing and quality control system for multi-source heterogeneous medical data, characterized by: It includes a data awareness and routing engine, a preprocessing pipeline cluster, and a closed-loop quality control module. The data awareness and routing engine is used to identify the type and quality of input data and perform dynamic routing accordingly. The preprocessing pipeline cluster is used to process data of different types or qualities; the closed-loop quality control module is used to evaluate the data quality in real time during the preprocessing process of the preprocessing pipeline cluster, obtain a score, and output data or trigger a feedback mechanism based on the score.
2. The adaptive preprocessing and quality control system for multi-source heterogeneous medical data according to claim 1, characterized in that: The data perception and routing engine includes a lightweight deep learning classifier, which is used to classify the source type and structure of the data.
3. The adaptive preprocessing and quality control system for multi-source heterogeneous medical data according to claim 1, characterized in that: The data perception and routing engine includes a feature extraction submodule, a type recognition submodule, a quality assessment submodule, and a routing decision submodule. The feature extraction submodule is used to extract surface features of the received data. The type recognition submodule is used to classify the extracted surface features; The quality assessment submodule is used to calculate the initial quality score of the data; the routing decision submodule is used to combine the type identification results and the quality score to perform dynamic routing.
4. The adaptive preprocessing and quality control system for multi-source heterogeneous medical data according to claim 3, characterized in that: The type recognition submodule uses the TextCNN or BERT-mini model, and the classification categories include standard equipment data, manually entered forms, free text reports, and image reports.
5. The adaptive preprocessing and quality control system for multi-source heterogeneous medical data according to claim 1, characterized in that: The preprocessing pipeline cluster includes at least a standard structured data pipeline, a text report parsing pipeline, and a low-quality data augmentation and repair pipeline. The standard structured data pipeline is used to process standard equipment data; the text report parsing pipeline is used to parse unstructured text; and the low-quality data augmentation and repair pipeline is used to process data with high missing rates and high outlier values.
6. The adaptive preprocessing and quality control system for multi-source heterogeneous medical data according to claim 5, characterized in that: The text report parsing pipeline uses the BERT model for named entity recognition and combines it with a medical knowledge graph for entity linking and normalization.
7. The adaptive preprocessing and quality control system for multi-source heterogeneous medical data according to claim 5, characterized in that: The low-quality data augmentation and repair pipeline uses the KNN algorithm to repair missing values based on similar patient records, or applies a generative model to fill in the data.
8. The adaptive preprocessing and quality control system for multi-source heterogeneous medical data according to claim 1, characterized in that: The closed-loop quality control module includes a quality scoring card and a feedback actuator. The quality scoring card scores data in real time based on the completeness, consistency and reasonableness of the data. The feedback actuator performs strategy switching, process rerouting or manual alarm based on the scoring results.
9. An adaptive preprocessing and quality control method for multi-source heterogeneous medical data, applied to the system described in any one of claims 1-8, characterized in that: Includes S1: inputting multi-source data; S2: identifying the type and quality profile of the input data through data awareness and routing engine; S3: Based on the identification results, dynamically route the data to the corresponding sub-pipeline in the preprocessing pipeline cluster; S4: During the sub-pipeline processing, perform real-time quality assessment through the closed-loop quality control module and obtain a score; S5: Determine whether the real-time quality assessment results meet the standards. If they do, proceed to S6; otherwise, proceed to S7. S6: Outputs high-quality standard data; S7: Trigger the feedback executor, which selects process jump or manual alarm based on the score.
10. The adaptive preprocessing and quality control system for multi-source heterogeneous medical data according to claim 1, characterized in that: In S7, process jump: trigger process rerouting, directing data to the low-quality data enhancement and repair pipeline in S3; manual alarm: send an alarm to the system administrator, along with a data snapshot and quality report, requesting manual review; strategy switching can be performed simultaneously during process jump.