Data element-oriented college medical ai model adaptation training optimization method
By constructing a data element governance system, enhancing features, and designing a hierarchical adaptive training network, the problems of multi-source heterogeneity and privacy security of medical data in universities were solved. This improved the training efficiency and adaptability of AI models, achieved standardized and secure governance of medical data in universities, and enhanced the privacy security of cross-university collaborative training and the continuous optimization of models.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- GUANGXI YIJIAN SMART INFORMATION TECHNOLOGY CO LTD
- Filing Date
- 2026-03-12
- Publication Date
- 2026-06-12
AI Technical Summary
The diverse and heterogeneous nature of medical data from universities, coupled with inconsistent formats, high risks of privacy leaks, and low data quality, leads to low training efficiency and poor adaptability of AI models. Furthermore, the lack of dynamic feedback mechanisms and cross-university collaborative training raises privacy and security concerns.
We will construct a data element governance system, perform standardized labeling and privacy desensitization, enhance features, design a hierarchical adaptive training network, build a dynamic feedback mechanism and a privacy-secure collaborative training framework, conduct scenario-based model verification and iterative optimization, and establish a module for analyzing the correlation between data elements and model performance.
It has achieved standardized and secure governance of medical data from universities, improved data quality and feature representation capabilities, enhanced model training efficiency and adaptability, strengthened privacy and security of cross-university collaborative training and model generalization ability, and ensured continuous model optimization and applicability.
Smart Images

Figure CN122201820A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of medical artificial intelligence technology, specifically a data-element-oriented method for optimizing the adaptability training of medical AI models in universities. Background Technology
[0002] Against the backdrop of the rapid development of medical artificial intelligence technology and the deep integration of universities' "medical education and research integration" development strategy, universities, as the core carriers for medical talent training, clinical scientific research innovation, and grassroots medical practice, have accumulated massive amounts of medical data covering various types, including clinical medical records, medical images, laboratory tests, and teaching cases. This data has become a core data element driving the development of medical AI models and adapting them to the specific medical scenarios of universities. However, in the current process of training and applying medical AI models in universities, there are still many industry pain points and technical bottlenecks surrounding the utilization of data elements and the optimization of model adaptability. These severely restrict the release of the value of university medical data elements and the effectiveness of medical AI models in university clinical teaching, scientific research analysis, and assisted diagnosis. Specific problems are reflected in the following aspects:
[0003] (1) University medical data exhibits significant multi-source heterogeneity. Data sources include multiple entities such as affiliated hospitals, teaching laboratories, and research teams within the university. Data formats include unstructured text medical records, semi-structured image data, and structured test indicators. Furthermore, there are issues such as data redundancy, incorrect data entry, and duplicate storage. At the same time, there is a lack of standardized data governance systems adapted to university medical scenarios. Data annotation is not designed specifically to meet the needs of university teaching and research, making it difficult for raw data to be directly used as effective data support for AI model training, and significantly reducing the usability of data elements.
[0004] (2) Medical data from universities generally suffer from problems such as uneven distribution of missing values, incompatibility of cross-modal features, and insufficient expression of low-dimensional features. For example, some vital signs data are missing in clinical teaching medical records, the feature spaces of medical images and diagnostic texts cannot be unified, and the dimensions of small sample data used for scientific research analysis are difficult to meet the needs of model training. Existing data processing methods are mostly general solutions and have not been designed to enhance features according to the characteristics of medical data from universities, resulting in low quality of basic data for model training, which in turn affects the training effect and generalization ability of the model.
[0005] (3) The lack of dynamic feedback mechanism between data elements and model performance during model training makes it impossible to evaluate the quality of data elements in real time during training. Problems such as data labeling errors, low feature discrimination, and poor scene matching are difficult to be detected and corrected in time, forming a vicious cycle of "low-quality data training low-performance model". On the other hand, the hyperparameter adjustment of model training is mostly done manually or statically, and cannot be adaptively adjusted according to the real-time training performance of the model, resulting in low model training efficiency, slow convergence speed, and difficulty in achieving optimal performance.
[0006] To address the aforementioned issues, we propose a data-driven approach for optimizing the adaptive training of AI models in university healthcare. Summary of the Invention
[0007] The purpose of this invention is to provide a data-element-oriented method for optimizing the adaptive training of AI models in medical colleges and universities, in order to solve the problems mentioned in the background art.
[0008] To achieve the above objectives, the present invention provides the following technical solution:
[0009] A data-driven approach to optimizing the adaptability training of AI models for medical use in universities includes the following steps:
[0010] S1. Data Element Governance and Preprocessing: Construct a data element governance system for medical data in universities, standardize, classify, and de-identify multi-source heterogeneous medical data in universities, generate standardized data element sets, and retain data traceability information;
[0011] S2. Data Feature Enhancement: Establish a data feature enhancement module to perform feature alignment, missing value completion, and dimension enhancement on the standardized data feature set, generate a feature-enhanced data feature set, and simultaneously establish and update the data feature index library.
[0012] S3. Construction of hierarchical adaptation training network: Design a hierarchical adaptation training network, adopt an encoder-multi-scale scene decoder structure, receive feature-enhanced data element set, and construct a scenario-based loss function in combination with the needs of clinical teaching, scientific research analysis and auxiliary diagnosis scenarios in colleges and universities, and output the initial version of the scene-adapted medical AI model.
[0013] S4. Dynamic Feedback Optimization of Data Elements: Establish a dynamic feedback mechanism for data elements to evaluate the quality of data elements in real time during model training, generate data element optimization strategies and feed them back to the data element governance and preprocessing stage. At the same time, based on the model performance feedback, the hyperparameters of the hierarchical adaptation training network are dynamically adjusted by the algorithm.
[0014] S5. Privacy-Security Collaborative Training: Construct a privacy-security training framework based on federated learning and differential privacy technology to achieve cross-university collaborative training of medical data elements from multiple universities. While shielding the original information of the data, it integrates the features of cross-university data elements and outputs a preliminary university medical AI adaptability model.
[0015] S6. Model Scenario-based Validation and Iteration: The initial medical AI adaptability model for universities will be standardized and validated in different medical scenarios in universities, generating a model adaptability evaluation report. Based on the report, data elements will be supplemented and the model will be iterated and optimized to output the final medical AI adaptability model for universities.
[0016] S7. Correlation Analysis between Data Elements and Model Performance: Construct a module for correlation analysis between data elements and model performance. Through association rule mining algorithms, analyze the relationship between data element characteristics, quality and model scenario-based performance, and generate data element optimization guidelines to provide a basis for data element selection and optimization for subsequent training of medical AI models in universities.
[0017] Preferably, the specific process of step S1 is as follows:
[0018] S11. Perform data cleaning on multi-source heterogeneous medical data from universities, remove redundant, erroneous, and duplicate data, and complete data format unification and grayscale normalization.
[0019] S12. Combining the ICD-11 disease classification standard with the needs of university medical scenarios, construct a data element labeling system for university medical data, perform structured labeling on the data, and add scenario labels, hierarchical labels and privacy labels to each data element;
[0020] S13. Based on the privacy label, use pseudonymization, data generalization, and local differential privacy technology to perform differentiated privacy desensitization processing, encrypt and store the elements with hierarchical labels as the core data, and retain the traceability index and association relationship of the data elements;
[0021] S14. Classify the desensitized dataset according to the scenario labels, generate standardized data element sets for clinical teaching, scientific research analysis, and auxiliary diagnosis scenarios, and establish a data element governance ledger.
[0022] Preferably, the specific process of step S2 is as follows:
[0023] S21. Based on the data feature index library, the mutual information maximization method is used to perform cross-modal feature alignment on the standardized data feature set to achieve feature space unification of image, text and test data.
[0024] S22. Combining medical domain knowledge with generative adversarial networks to complete missing values in medical data, and using an attention mechanism to enhance the dimensionality of low-dimensional data elements to generate high-dimensional feature vectors.
[0025] S23. Use the random forest feature importance ranking algorithm to filter the data, remove invalid and redundant features, and generate a feature-enhanced data element set.
[0026] S24. Add weight labels to each feature in the feature enhancement data element set. The weight values are determined by the weighted sum of the clinical actual value of the data element and its contribution to model training. Update the feature information to the data element feature index library.
[0027] Preferably, the specific process of step S3 is as follows:
[0028] S31. Normalize the feature-enhanced data element set according to the feature weight labels to form a multi-channel tensor as the unified input of the hierarchical adaptation training network.
[0029] S32. Construct a depthwise separable convolutional encoder structure to extract global features and local detail features of data elements layer by layer, and introduce a skip connection mechanism to retain shallow fine-grained feature information.
[0030] S33. Construct a multi-scale scene decoder structure, design independent decoding branches for clinical teaching, scientific research analysis, and auxiliary diagnosis scenarios, and perform targeted feature fusion for each branch according to scenario requirements;
[0031] S34. Construct a scenario-based loss function consisting of weighted interpretability loss, diversity loss, diagnostic accuracy loss, and cross-scenario consistency loss. This function, along with the outputs of each scenario decoding branch, participates in supervised network training to output an initial version of the scenario-adapted medical AI model.
[0032] Preferably, the specific process of step S4 is as follows:
[0033] S41. Construct a data element quality assessment index system that includes data integrity, annotation accuracy, feature discrimination, and scene matching, and conduct real-time quantitative assessment of data elements during the training process.
[0034] S42. If the data element quality assessment result is lower than the preset threshold, generate optimization strategies such as re-labeling, secondary feature enhancement, and scenario-based data supplementation, and feed them back to step S1 to complete the closed-loop optimization of data elements.
[0035] S43. Construct a model performance evaluation index system that includes scenario-based accuracy, recall, F1 score and cross-scenario adaptability, and monitor the performance of the initial version of the model in real time.
[0036] S44. Based on the model performance evaluation results, the Bayesian optimization algorithm is used to dynamically adjust the hyperparameters of the hierarchical adaptation training network, such as the learning rate, batch size, and number of convolutional kernels.
[0037] Preferably, the specific process of step S5 is as follows:
[0038] S51. Construct a federated learning architecture and set up a collaborative training center for medical data elements in universities. Each university acts as a federated node and only uploads the feature model of its local feature-enhanced data element set to the collaborative training center, while the original data is kept locally.
[0039] S52. During the local training process at each federation node, the privacy budget and noise coefficient are dynamically set according to the data privacy label, and adaptive differential privacy noise is added.
[0040] S53. The Collaborative Training Center performs consistency verification on the feature models uploaded by each federated node, uses the federated averaging algorithm to complete the global aggregation and update of model parameters, and distributes the updated global model parameters to each federated node.
[0041] S54. Each federated node performs local fine-tuning training based on global model parameters, repeating the process of "local training - model upload - global aggregation - parameter distribution" until the model converges, and outputs a preliminary medical AI adaptability model for universities.
[0042] Preferably, the specific process of step S6 is as follows:
[0043] S61. Construct an independent scenario-based validation dataset covering clinical teaching simulation, scientific research data mining, and clinical auxiliary diagnosis, and the dataset is not used for model training;
[0044] S62. Test the preliminary medical AI adaptability model in the validation dataset, calculate the performance indicators of each scenario and the cross-scenario adaptability indicators, and generate a model adaptability evaluation report.
[0045] S63. Based on the evaluation report, identify the shortcomings in model performance, determine the types and quantities of data elements to be supplemented, and feed back to step S1 to complete the targeted data element supplementation.
[0046] S64. Based on the supplemented feature-enhanced data element set, the preliminary model is retrained and fine-tuned to achieve iterative upgrade of the model and output the final medical AI adaptability model for universities.
[0047] Preferably, the specific process of step S7 is as follows:
[0048] S71. Construct a data element and model performance correlation analysis module based on the Apriori algorithm for association rule mining, taking the feature indicators and quality indicators of data elements as antecedents and the scenario-based performance indicators of the model as consequents, and setting minimum support and minimum confidence.
[0049] S72. Input the data element indicator data and model performance indicator data into the correlation analysis module to explore the strong correlation rules between data elements and model performance, and analyze and clarify the key data element characteristics and quality thresholds that affect the performance of the model in various scenarios.
[0050] S73. Based on the results of association rule mining and analysis, generate a guide to optimize data elements for training AI models in medical universities, clarifying the optimal combination of data elements and related technical parameter settings for model training in different scenarios.
[0051] A data-driven system for adaptability training and optimization of AI models in medical fields for universities, comprising:
[0052] The data element governance preprocessing module is used to clean, standardize, classify, desensitize, and classify multi-source heterogeneous medical data from universities, generate standardized data element sets, and establish a data element governance ledger.
[0053] The data feature enhancement module communicates with the data feature governance preprocessing module to realize cross-modal feature alignment, missing value completion, dimension enhancement and invalid feature filtering of standardized data feature sets, generate feature-enhanced data feature sets and update the data feature index library;
[0054] The hierarchical adaptation training network module communicates with the data feature enhancement module and adopts an encoder-multi-scale scene decoder structure to complete the hierarchical scene-based training of the model and output the initial version of the scene-adapted medical AI model.
[0055] The dynamic feedback optimization module communicates with the data element governance and preprocessing module and the hierarchical adaptation training network module respectively. It is used to evaluate the data element quality and model performance in real time, and realize closed-loop optimization of data elements and adaptive adjustment of training network hyperparameters.
[0056] The privacy and security collaborative training module communicates with the hierarchical adaptation training network module to realize cross-university privacy and security collaborative training of medical data elements from multiple universities, and outputs a preliminary medical AI adaptability model for universities.
[0057] The scenario-based verification iteration module communicates with the privacy and security collaborative training module and the data element governance preprocessing module to complete the full-scenario standardized verification of the model and realize data element supplementation and model iterative optimization.
[0058] The data element and model performance correlation analysis module communicates with the above modules to explore the correlation between data elements and model performance and generate data element optimization guidelines.
[0059] The data storage module communicates and connects with all the above modules to store the entire process of data, including datasets, model parameters, evaluation reports, optimization guidelines, etc., enabling secure data storage and efficient retrieval.
[0060] Preferably, the data element governance preprocessing module includes a data cleaning and normalization unit, a standardization and annotation unit, a privacy desensitization and encryption unit, and a scenario-based classification and ledger unit; the hierarchical adaptive training network module includes an input layer unit, a depthwise separable convolutional encoder unit, a multi-scale scene decoder unit, and a scenario-based loss function unit; the privacy and security collaborative training module includes a federated learning architecture unit, a differential privacy noise embedding unit, a global model aggregation unit, and a multi-round collaborative training unit.
[0061] Compared with the prior art, the beneficial effects of the present invention are:
[0062] This invention constructs a dedicated governance system for medical data elements in universities. It combines the ICD-11 standard with the multi-scenario needs of universities to achieve standardized data labeling, uses differentiated privacy protection technology to complete the de-identification process, and retains data traceability information. It solves the problems of inconsistent formats, high risk of privacy leakage, and lack of governance of medical data in universities, and realizes the standardized and secure governance of medical data elements, providing high-quality basic data support for AI model training.
[0063] This invention establishes a full-process mechanism for enhancing data element features. It achieves cross-modal feature alignment by maximizing mutual information, completes missing values accurately by combining domain knowledge and GAN, and enhances dimensionality by using an attention mechanism. This effectively improves the feature expression capability of medical data elements and solves the problems of many missing values, cross-modal incompatibility, and low dimensionality in medical data, providing high-dimensional and high-quality feature data for model training.
[0064] This invention designs a hierarchical adaptive training network of encoder-multi-scale scene decoder. It designs independent decoding branches to meet the differentiated needs of clinical teaching, scientific research analysis and assisted diagnosis in universities, and constructs a fusion-type scene-based loss function to realize scene-based hierarchical training of the model. This solves the problem of poor scene adaptability of existing general models and takes into account the interpretability, diversity and accuracy of the model in different scenarios.
[0065] This invention establishes a two-dimensional dynamic feedback mechanism for data elements and model performance, enabling real-time evaluation and closed-loop optimization of data element quality. Simultaneously, it employs a Bayesian optimization algorithm to adaptively adjust hyperparameters, constructing a dynamic optimization closed loop for model training. This solves the problems of data issues not being detected in real time and the blind adjustment of hyperparameters during model training, thereby improving model training efficiency and performance.
[0066] This invention integrates federated learning and differential privacy technology to construct a privacy-secure collaborative training framework, enabling cross-university collaborative training of data elements from multiple universities. The original data is retained locally, and only the feature model is uploaded. At the same time, adaptive differential privacy noise is embedded, which solves the privacy and security problems of medical data silos in universities and cross-university collaborative training. It fully integrates the advantages of medical data elements from various universities and improves the generalization ability of the model.
[0067] This invention establishes a systematic model scenario-based verification and iteration mechanism. It uses independent real-world scenario datasets that were not used in training for verification, and completes targeted data element supplementation and model iteration optimization based on the evaluation report. This solves the problems of non-objective model verification and directionless iteration, realizes continuous optimization and upgrading of the model, and improves the practicality of the model in multiple scenarios in universities.
[0068] This invention adds a data element and model performance correlation analysis module. By mining association rules, it identifies the key data element characteristics and quality thresholds that affect model performance, and generates a data element optimization guide. This upgrades the model from "data-driven training" to "data optimization-guided training", providing a standardized basis for data element selection and optimization for subsequent medical AI model training in universities, and forming a reusable and scalable technical system.
[0069] This invention also constructs a corresponding training and optimization system, which modularizes and systematizes the entire process method. Each module is interconnected and functionally independent, realizing full-process automation and intelligence of medical AI model training and optimization in universities. The system has strong scalability and can be fine-tuned according to the medical scenario needs of different universities, making it suitable for the construction and optimization of medical AI models in various universities. Attached Figure Description
[0070] Figure 1 This is a schematic diagram of the method steps of the present invention. Detailed Implementation
[0071] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0072] like Figure 1 As shown, the data-driven method for optimizing the adaptability training of AI models in university healthcare includes the following steps:
[0073] S1. Data Element Governance and Preprocessing: Construct a data element governance system for medical data in universities, standardize, classify, and de-identify multi-source heterogeneous medical data in universities, generate standardized data element sets, and retain data traceability information;
[0074] S11. Perform data cleaning on the multi-source heterogeneous medical data from universities, removing redundant, erroneous, and duplicate data, and completing data format unification and grayscale normalization. The grayscale normalization calculation formula is as follows:
[0075]
[0076] in, The normalized grayscale value. The original grayscale value. , These represent the maximum and minimum grayscale values in the dataset, respectively.
[0077] Numerical normalization uses the Z-score standardization formula:
[0078]
[0079] in, The values are standardized. These are the original values. The mean of the dataset. The standard deviation of the dataset;
[0080] S12. Combining the ICD-11 disease classification standard with the needs of university medical scenarios, construct a data element labeling system for university medical data, perform structured labeling on the data, and add scenario labels, hierarchical labels and privacy labels to each data element;
[0081] S13. Based on the privacy label, pseudonymization, data generalization, and local differential privacy techniques are used for differentiated privacy desensitization. The formula for adding local differential privacy noise is:
[0082]
[0083] in, For anonymized data, This is the original data. It follows a Laplace distribution. For function sensitivity, Budget for privacy;
[0084] Encrypt and store the elements with hierarchical labels as the core data, while preserving the traceability index and association relationship of the data elements;
[0085] S14. Classify the desensitized dataset according to the scenario labels, generate standardized data element sets for clinical teaching, scientific research analysis, and auxiliary diagnosis scenarios, and establish a data element governance ledger.
[0086] S2. Data Feature Enhancement: Establish a data feature enhancement module to perform feature alignment, missing value completion, and dimension enhancement on the standardized data feature set, generate a feature-enhanced data feature set, and simultaneously establish and update the data feature index library.
[0087] S21. Based on the data feature index library, the mutual information maximization method is used to perform cross-modal feature alignment on the standardized data feature set, realizing the unification of feature spaces for image, text, and test data. The mutual information calculation formula is:
[0088]
[0089] in, For feature set and mutual information, For joint probability distribution, , It represents a marginal probability distribution;
[0090] S22. Combining medical domain knowledge with generative adversarial networks, complete the missing value completion of medical data. An attention mechanism is used to enhance the dimensionality of low-dimensional data elements. The formula for calculating the attention weight is as follows:
[0091]
[0092]
[0093] in, For the first Attention weights for each feature To score attention, , For learnable weight matrix, For bias terms, For the first 1 eigenvector;
[0094] High-dimensional feature vectors are generated by weighted fusion based on attention weights.
[0095]
[0096] in, It is a high-dimensional feature vector. This is a dimension mapping matrix;
[0097] S23. The random forest feature importance ranking algorithm is used to filter features in the data, removing invalid and redundant features, and generating a feature-enhanced data feature set. The formula for calculating feature importance is as follows:
[0098]
[0099] in, For the first The importance of each feature For the number of decision trees, For the first The number of samples in each decision tree. The total number of samples, For the first Features in a decision tree Information gain;
[0100] S24. Add weight labels to each feature in the feature-enhanced data element set. The weight value is determined by the weighted sum of the clinical actual value of the data element and its contribution to model training. Update the feature information to the data element feature index library. The formula for calculating the weight value is:
[0101]
[0102]
[0103] in, For feature weights, , These are the weighting coefficients for clinical practical value and contribution to model training, respectively. Clinical experts were scored (normalized to [0, 1]). Contribution to model training (normalized to [0, 1]).
[0104] S3. Construction of hierarchical adaptation training network: Design a hierarchical adaptation training network, adopt an encoder-multi-scale scene decoder structure, receive feature-enhanced data element set, and construct a scenario-based loss function in combination with the needs of clinical teaching, scientific research analysis and auxiliary diagnosis scenarios in colleges and universities, and output the initial version of the scene-adapted medical AI model.
[0105] S31. Normalize the feature-enhanced data element set according to the feature weight labels to form a multi-channel tensor as the unified input of the hierarchical adaptation training network.
[0106] S32. Construct a depthwise separable convolutional encoder structure. The formula for calculating depthwise separable convolution is:
[0107]
[0108] in, For depthwise separable convolution output, For depthwise convolution, For pointwise convolution, For input features;
[0109] Global and local detail features of data elements are extracted layer by layer. A skip connection mechanism is introduced to retain shallow, fine-grained feature information. The skip connection feature fusion formula is as follows:
[0110]
[0111] in, For the first Layer features, For the first Layer features;
[0112] S33. Construct a multi-scale scene decoder structure, design independent decoding branches for clinical teaching, scientific research analysis, and assisted diagnosis scenarios. Each branch performs targeted feature fusion according to scenario requirements. The multi-scale feature fusion formula is as follows:
[0113]
[0114] in, To achieve multi-scale features after fusion, For the first The fusion weights of features at each scale For upsampling operation, For the first Features at each scale;
[0115] S34. Construct a scenario-based loss function composed of weighted losses of interpretability, diversity, diagnostic accuracy, and cross-scenario consistency. The formula is as follows:
[0116]
[0117]
[0118] Among them, the interpretability loss is Diversity loss is The diagnostic accuracy loss is Cross-scenario consistency loss is , These are the weighting coefficients for each loss term;
[0119] Interpretability loss is calculated based on Grad-CAM:
[0120]
[0121] in, For the model attention heatmap, Mark the area for the experts. The number of samples;
[0122] Diversity loss is calculated using feature entropy:
[0123]
[0124] in, The number of feature categories, For the first The probability of class features;
[0125] The diagnostic accuracy loss is calculated using cross-entropy loss:
[0126]
[0127] in,
[0128] Number of diagnostic categories
[0129]
[0130] in, Number of diagnostic categories For real labels, Predict probabilities for the model;
[0131] Cross-scenario consistency loss is calculated using mean squared error:
[0132]
[0133] in, , For the sample In the scene , The following characteristics;
[0134] The loss function and the output of the decoding branches for each scenario participate in the supervised training of the network, outputting an initial version of the medical AI model adapted to the scenario.
[0135] S4. Dynamic Feedback Optimization of Data Elements: Establish a dynamic feedback mechanism for data elements to evaluate the quality of data elements in real time during model training, generate data element optimization strategies and feed them back to the data element governance and preprocessing stage. At the same time, based on the model performance feedback, the hyperparameters of the hierarchical adaptation training network are dynamically adjusted by the algorithm.
[0136] S41. Construct a data element quality assessment index system that includes data integrity, annotation accuracy, feature discrimination, and scene matching degree. Perform real-time quantitative evaluation of data elements during the training process. The comprehensive quality scoring formula is:
[0137]
[0138]
[0139] in, To improve the overall quality of data elements, Weight of indicators For data integrity, For labeling accuracy, For feature discrimination, The scene matching degree is (all normalized to [0, 1]);
[0140] The formula for calculating feature discrimination is:
[0141]
[0142] in, , For the first The mean and standard deviation of class features. The number of feature categories;
[0143] S42. If the data element quality assessment result is... Below the preset threshold The optimization strategy of re-labeling, secondary feature enhancement, and scenario-based data supplementation is generated and fed back to step S1 to complete the closed-loop optimization of data elements.
[0144] S43. Construct a model performance evaluation index system that includes scenario-specific accuracy, recall, F1 score, and cross-scenario adaptability. Monitor the performance of the initial version of the model in real time. The calculation formulas for each index are as follows:
[0145]
[0146]
[0147]
[0148]
[0149] in, For accuracy, For recall rate, For F1 value, , , , These represent the number of true positive, true negative, false positive, and false negative samples, respectively. For cross-scenario adaptability, For the number of scenes, For the sample In the scene Predicted labels, For real labels, For indicator functions;
[0150] S44. Based on the model performance evaluation results, the learning rate, batch size, number of convolutional kernels, and other hyperparameters of the hierarchical adaptation training network are dynamically adjusted using the Bayesian optimization algorithm. The objective function of the Bayesian optimization is:
[0151]
[0152] in, For the optimal combination of hyperparameters, For hyperparameter set, For hyperparameters The corresponding model performance, This is a performance mapping function.
[0153] S5. Privacy-Security Collaborative Training: Construct a privacy-security training framework based on federated learning and differential privacy technology to achieve cross-university collaborative training of medical data elements from multiple universities. While shielding the original information of the data, it integrates the features of cross-university data elements and outputs a preliminary university medical AI adaptability model.
[0154] S51. Construct a federated learning architecture and set up a collaborative training center for medical data elements in universities. Each university acts as a federated node and only uploads the feature model of its local feature-enhanced data element set to the collaborative training center, while the original data is kept locally.
[0155] S52. During local training at each federation node, dynamically set the privacy budget based on the data privacy label. In addition to the noise figure, adaptive differential privacy noise is added, and the noise variance is calculated using the following formula:
[0156]
[0157] in, For parameter update magnitude, Step size, This represents the probability of failure.
[0158] S53. The collaborative training center performs consistency verification on the feature models uploaded by each federated node, uses the federated averaging algorithm to complete the global aggregation and update of model parameters, and distributes the updated global model parameters to each federated node. The federated average parameter update formula is:
[0159]
[0160] in, These are global model parameters. For the number of federated nodes, For the first The number of samples per node. The total number of samples, For the first Local model parameters for each node;
[0161] S54. Each federated node performs local fine-tuning training based on the global model parameters, repeating the process of "local training - model upload - global aggregation - parameter distribution" until the model converges, outputting a preliminary medical AI adaptability model for universities. The convergence criterion is:
[0162]
[0163] in, For the first Round global parameters, This is the convergence threshold.
[0164] S6. Model Scenario-based Validation and Iteration: The initial medical AI adaptability model for universities will be standardized and validated in different medical scenarios in universities, generating a model adaptability evaluation report. Based on the report, data elements will be supplemented and the model will be iterated and optimized to output the final medical AI adaptability model for universities.
[0165] S61. Construct an independent scenario-based validation dataset covering clinical teaching simulation, scientific research data mining, and clinical auxiliary diagnosis, and the dataset is not used for model training;
[0166] S62. Test the preliminary medical AI adaptability model in universities on the validation dataset, calculate the performance indicators for each scenario and the cross-scenario adaptability indicators, and generate a model adaptability evaluation report. The overall model adaptability score formula is:
[0167]
[0168] in, For the overall adaptability of the model, As scene weight, , , For scenario s, the F1 score, precision, and recall are... For scene s, the cross-scene adaptation coefficient;
[0169] S63. Based on the assessment report, identify the model's performance shortcomings, determine the types and quantities of data elements to be supplemented, and feed this information back to step S1 to complete targeted data element supplementation. The formula for calculating the amount of data supplementation is:
[0170]
[0171] in, For the scene The amount of supplementary data, For target performance, For current performance, Basic data volume;
[0172] S64. Based on the supplemented feature-enhanced data element set, the preliminary model is retrained and fine-tuned to achieve iterative upgrade of the model and output the final medical AI adaptability model for universities.
[0173] S7. Correlation Analysis Between Data Elements and Model Performance: Construct a module for correlation analysis between data elements and model performance. Through association rule mining algorithms, analyze the relationship between data element characteristics, quality and model scenario performance, and generate a data element optimization guide to provide a basis for data element selection and optimization for subsequent training of medical AI models in universities.
[0174] S71. Construct a module for analyzing the correlation between data elements and model performance based on the Apriori algorithm for association rule mining. Use the feature indicators and quality indicators of the data elements as antecedents, and the scenario-based performance indicators of the model as consequents, setting a minimum support. With minimum confidence ;
[0175] S72. Input the data element indicator data and model performance indicator data into the correlation analysis module to discover strong correlation rules between data elements and model performance. Analyze and clarify the key data element characteristics and quality thresholds that affect the performance of the model in various scenarios. The formulas for calculating the support and confidence of the correlation rules are as follows:
[0176]
[0177]
[0178] in, A set of data element features / quality indicators. For the set of model performance metrics, For sample counting, For the total sample set;
[0179] Extracting satisfaction and Strong association rules;
[0180] S73. Based on the results of association rule mining and analysis, generate a guide to optimize data elements for training AI models in medical colleges and universities, clarifying the optimal combination of data elements and related technical parameter settings for model training in different scenarios;
[0181] Based on strong correlation rules, data element quality thresholds are determined, and a threshold optimization formula is used:
[0182]
[0183] in, To determine the optimal threshold for data elements, Generate a data element optimization guide for the model performance target value.
[0184] A data-driven system for optimizing the adaptation of AI models in medical settings in universities includes:
[0185] The data element governance preprocessing module is used to clean, standardize, classify, de-identify, and contextualize heterogeneous multi-source medical data from universities, generate standardized data element sets, and establish a data element governance ledger. This module includes a data cleaning and normalization unit, a standardization and labeling unit, a privacy de-identification and encryption unit, and a contextual classification and ledger unit.
[0186] The data feature enhancement module communicates with the data feature governance preprocessing module to realize cross-modal feature alignment, missing value completion, dimension enhancement and invalid feature filtering of standardized data feature sets, generate feature-enhanced data feature sets and update the data feature index library;
[0187] The hierarchical adaptation training network module communicates with the data feature enhancement module and adopts an encoder-multi-scale scene decoder structure to complete the hierarchical scene-based training of the model and output the initial version of the scene-adapted medical AI model. This module includes an input layer unit, a depthwise separable convolutional encoder unit, a multi-scale scene decoder unit, and a scene-based loss function unit.
[0188] The dynamic feedback optimization module communicates with the data element governance and preprocessing module and the hierarchical adaptation training network module respectively. It is used to evaluate the data element quality and model performance in real time, and realize closed-loop optimization of data elements and adaptive adjustment of training network hyperparameters.
[0189] The privacy-preserving collaborative training module communicates with the hierarchical adaptation training network module to achieve cross-university privacy-preserving collaborative training of medical data elements from multiple universities, and outputs a preliminary university medical AI adaptability model; this module includes a federated learning architecture unit, a differential privacy noise embedding unit, a global model aggregation unit, and a multi-round collaborative training unit.
[0190] The scenario-based verification iteration module communicates with the privacy and security collaborative training module and the data element governance preprocessing module to complete the full-scenario standardized verification of the model and realize data element supplementation and model iterative optimization.
[0191] The data element and model performance correlation analysis module communicates with the above modules to explore the correlation between data elements and model performance and generate data element optimization guidelines.
[0192] The data storage module communicates and connects with all the above modules to store the entire process of data, including datasets, model parameters, evaluation reports, optimization guidelines, etc., enabling secure data storage and efficient retrieval. Specific Implementation Example 1:
[0194] This embodiment uses a medical university in China and its affiliated hospitals and teaching laboratories as the main application. This university has accumulated multi-source heterogeneous medical data, including unstructured electronic medical records, semi-structured medical images, structured test data, and teaching case data, with a total data volume of 8TB. The data covers 12 departments, including internal medicine, surgery, obstetrics and gynecology, and pediatrics. Based on this data, the method of this invention carries out adaptive training and optimization of medical AI models for universities, and finally realizes the construction of medical AI models that are adapted to the needs of medical education and research integration of this medical university.
[0195] S1. Data Element Governance and Preprocessing:
[0196] A dedicated data element governance system adapted to university medical scenarios was constructed, and the standardized processing and secure anonymization of raw data were completed to provide a standardized and traceable basic data element set for subsequent model training. The specific implementation process is as follows:
[0197] S11. Data Cleaning and Normalization: First, a full cleaning of the heterogeneous medical data from multiple sources at the medical university was performed. Duplicate electronic medical records and image data were removed using a data deduplication algorithm, resulting in approximately 150,000 duplicate records. Rule-based validation was used to identify and delete incorrectly entered test indicators, vital signs, and other invalid data. Inconsistent data formats were corrected by converting DICOM format image data, TXT format medical record data, and Excel format test data into a standardized data format. Grayscale image data was processed using a grayscale normalization formula, while numerical test data was standardized using the Z-score formula, mapping data values to the [0, 1] interval or standardizing them to a mean of 0 and a variance of 1, eliminating the impact of differences in data units on subsequent model training.
[0198] S12. Standardized Structured Annotation: Combining the ICD-11 disease classification standard with the clinical teaching, research analysis, and assisted diagnosis needs of the medical university, a dedicated annotation system for university medical data elements was constructed. Three core tags were added to each data element: scenario tags, hierarchical tags, and privacy tags. Scenario tags were divided into three categories: "clinical teaching," "research analysis," and "assisted diagnosis." Hierarchical tags were divided into three categories based on data importance: "core data," "important data," and "general data." Privacy tags were divided into four categories based on privacy level: "high privacy," "medium privacy," "low privacy," and "non-privacy." A labeling team composed of medical experts and AI algorithm engineers performed structured annotation on the cleaned dataset, completing the annotation of approximately 2 million data elements. After annotation, cross-validation was used to ensure an accuracy rate of over 99%.
[0199] S13. Differentiated Privacy Desensitization and Encrypted Storage: Differentiated privacy desensitization is implemented based on the privacy tags of data elements. For "highly private" data, pseudonymization + local differential privacy technology is used to replace sensitive information such as patient names, ID numbers, and hospital numbers with random pseudonyms. Adaptive noise is added using a local differential privacy noise addition formula (set). , For "moderate privacy" data, data generalization technology is used to generalize the patient's specific age and consultation time into age ranges and consultation months; for "low privacy" and "non-privacy" data, only basic anonymization processing is performed. Meanwhile, for elements categorized as "core data," such as images of difficult cases and core research data, the national cryptographic algorithm SM4 is used for encrypted storage. During the anonymization and encryption process, a unique traceability index is retained for each data element, recording the data's source, processing procedure, and annotation information to ensure the correlation and traceability of data elements.
[0200] S14. Scenario-based classification and ledger establishment: The desensitized datasets are classified according to scenario labels, generating standardized data element sets for three major scenarios: clinical teaching, scientific research analysis, and auxiliary diagnosis. The clinical teaching data element set mainly consists of teaching cases and typical medical records, the scientific research analysis data element set mainly consists of multi-center cases and laboratory indicator sequences, and the auxiliary diagnosis data element set mainly consists of clinical diagnostic images and real-time vital signs data. At the same time, a data element governance ledger is established to record information such as the scale, category, quality, processing time, and responsible person of each data element set, realizing full-process management of data elements.
[0201] S2, Enhanced Data Feature Characteristics:
[0202] Based on the standardized data feature set generated in step S1, the feature representation capability of the data features is improved through cross-modal feature alignment, missing value completion, dimensionality enhancement, and feature filtering. This generates a feature-enhanced data feature set and updates the data feature feature index library. The specific implementation process is as follows:
[0203] S21. Cross-modal feature alignment: Based on a pre-built data element feature index library, feature vectors of three types of cross-modal data (image, text, and test) in the standardized data element set are extracted. The mutual information calculation formula is used to calculate the mutual information value between different modal data. With the maximum mutual information value as the optimization target, the cross-modal features are spatially and semantically aligned to achieve feature space unification of visual features of image data, semantic features of text medical records, and numerical features of test data. This solves the problem of cross-modal feature incompatibility and enables different modal data to participate in model training collaboratively. In this embodiment, the average cross-modal feature mutual information value is increased to over 0.85.
[0204] S22. Missing Value Imputation and Dimensionality Enhancement: Combining knowledge from the clinical medical field, the missing values in the dataset are first classified into two categories: random missing values and non-random missing values. For non-random missing values, a medical rule-based imputation method is used, based on clinical practice guidelines and medical common sense. For random missing values, a Generative Adversarial Network (GAN) is used for imputation. Using high-quality complete data as training samples, the GAN model is trained to generate missing values that conform to the data distribution pattern. After imputation, the missing value rate of the data is reduced from the original 18% to below 3%. Addressing the issue of insufficient dimensionality in small sample data used for scientific research analysis, the dimensionality of low-dimensional data elements is enhanced through attention weight calculation formulas and high-dimensional feature generation formulas, increasing the original 64-dimensional feature vector to 512 dimensions, thereby improving the feature representation capability of the data elements.
[0205] S23. Feature Filtering: The Random Forest Feature Importance Ranking Algorithm is adopted. The feature importance of the dataset after completion and enhancement is calculated using the feature importance calculation formula. The features are ranked according to their importance scores. Invalid and redundant features with scores below a preset threshold (0.05) are removed, such as patient visit serial numbers and non-core diagnostic indicators that are not related to model training. A total of about 30 invalid features are removed, which simplifies the feature space and improves the feature discrimination of data elements.
[0206] S24. Weight Labeling and Index Update: Weight labels are added to each feature in the feature-enhanced data feature set using the feature weight calculation formula. , The clinical value is scored by medical experts, and the contribution of the model to training is calculated by the degree of influence of the feature on the model loss function. The feature information, weight labels, feature importance scores, etc. of the feature enhancement data element set are updated to the data element feature index library to realize the dynamic updating of the index library and provide a basis for feature selection for subsequent model training.
[0207] S3. Construction of hierarchical adaptive training network:
[0208] A hierarchical adaptation training network based on an encoder-multi-scale scene decoder is used to construct a scene-specific loss function in combination with the differentiated needs of three major medical scenarios in universities. This completes the hierarchical scene-specific training of the model and outputs an initial version of the scene-adapted medical AI model. The specific implementation process is as follows:
[0209] S31. Input data preprocessing: The feature enhancement data feature set generated in step S2 is normalized according to the feature weight label, and grouped according to the modality and type of the features to form a multi-channel tensor as the unified input of the hierarchical adaptation training network, so as to ensure the consistent feature distribution of the input data and improve the stability of network training.
[0210] S32. Construction of a depthwise separable convolutional encoder: A depthwise separable convolutional encoder structure is constructed using the depthwise separable convolution calculation formula. The encoder has a total of 8 convolutional layers, which extract global features and local detail features of data elements layer by layer, such as the global morphology and local texture features of lesions in image data, and the semantic global features and keyword local features in text data. A skip connection mechanism is introduced into the encoder. The fine-grained features extracted by the shallow convolutional layers are directly passed to the deep convolutional layers through the skip connection feature fusion formula, which avoids feature loss during deep training and preserves the detailed feature information of data elements.
[0211] S33. Construction of a Multi-Scale Scene Decoder: A multi-scale scene decoder structure is constructed. Three independent decoding branches are designed to address the differentiated needs of three scenarios: clinical teaching, scientific research analysis, and assisted diagnosis. Each branch has different convolutional kernel sizes and upsampling rates. Targeted feature fusion is achieved through a multi-scale feature fusion formula. Specifically, the clinical teaching decoding branch enhances feature interpretability, highlighting the logical characteristics of diagnostic and treatment decisions; the scientific research analysis decoding branch enhances feature diversity, preserving multi-dimensional feature information of the data; and the assisted diagnosis decoding branch enhances feature discriminability, highlighting the differences between lesions and normal tissues. The decoder gradually restores the spatial resolution of features through multi-scale upsampling and, combined with the feature information transmitted by the encoder, completes feature reconstruction for each scenario.
[0212] S34. Construction of Context-Based Loss Function and Network Training: A context-based loss function is constructed using the context-based total loss formula, consisting of a weighted average of interpretability loss, diversity loss, diagnostic accuracy loss, and cross-context consistency loss. , , , Each loss term is calculated using its corresponding formula. The scenario-specific loss function and the outputs of each scenario decoding branch are used together in the network supervised training. The training batch size is set to 32, the initial learning rate is 0.001, and the number of training iterations is 1000. When the loss function converges to a stable value, training is stopped, and the initial version of the scenario-adapted medical AI model is output.
[0213] S4. Dynamic feedback optimization of data elements:
[0214] A dual-dimensional dynamic feedback mechanism for data elements and model performance is established to achieve real-time evaluation and closed-loop optimization of data element quality, as well as adaptive adjustment of training network hyperparameters, thereby improving model training efficiency and performance. The specific implementation process is as follows:
[0215] S41 Real-time Data Element Quality Assessment: Constructing a data element quality assessment indicator system that includes four core indicators: data integrity, annotation accuracy, feature discrimination, and scene matching. , , , The system performs real-time quantitative evaluation using a comprehensive quality scoring formula and a feature discrimination formula, setting preset thresholds for each indicator (data integrity ≥ 95%, annotation accuracy ≥ 99%, feature discrimination ≥ 0.8, scene matching ≥ 90%, and comprehensive quality Q ≥ 0.9). During model training, a sliding window method is used to perform real-time quantitative evaluation of data elements within the training batch, completing an evaluation every 10 rounds of training and generating a data element quality evaluation report.
[0216] S42. Data Element Closed-Loop Optimization: If the data element quality assessment result Q is lower than the preset threshold of 0.9, a targeted optimization strategy is generated based on the specific low-performing indicators: if data completeness is insufficient, scenario-based data supplementation is performed; if annotation accuracy is insufficient, a re-annotation process is initiated; if feature discrimination is insufficient, secondary feature enhancement is performed; if scenario matching is insufficient, scenario-based classification of data elements is re-performed. The generated optimization strategy is fed back to step S1 to reprocess the data elements, completing the closed-loop optimization of data elements and ensuring that the data elements participating in model training always maintain high quality.
[0217] S43. Real-time Model Performance Monitoring: Construct a model performance evaluation index system that includes four core indicators: scenario-specific accuracy, recall, F1 score, and cross-scenario adaptability. The training performance of the initial version of the model is monitored in real time through the calculation formula of each indicator. A performance test is completed every 20 rounds of training, and the accuracy, recall, F1 score of the model in the three major scenarios and the adaptability of the model in cross-scenario applications are calculated to comprehensively measure the training effect of the model.
[0218] S44. Adaptive Hyperparameter Adjustment: Based on the model performance evaluation results, the hyperparameters of the hierarchical adaptive training network are dynamically adjusted using the Bayesian optimization algorithm. Guided by the Bayesian optimization objective function, the hyperparameters to be optimized include the learning rate, batch size, and number of convolutional kernels. In this embodiment, after Bayesian optimization, the learning rate is adjusted to 0.0008, the batch size is adjusted to 48, and the number of convolutional kernels is optimized to 128. The convergence speed of model training is improved by about 40%, and the F1 score of the model in each scenario is improved by more than 5%.
[0219] S5. Privacy and Security Collaborative Training:
[0220] By integrating federated learning and differential privacy technologies, a privacy-preserving collaborative training framework was constructed. This framework, in collaboration with three partner universities of the medical university, facilitated cross-university collaborative training of medical data elements. While ensuring data privacy and security, the framework incorporated cross-university data element features to enhance the model's generalization ability and output a preliminary university-adaptive AI model for medical applications. The specific implementation process is as follows:
[0221] S51. Federated Learning Architecture Setup: Construct a collaborative training center for medical data elements from universities, deployed on the university's cloud computing platform as the central node for federated learning; the university and three partner universities are each designated as local nodes for federated learning. Each local node only uploads the feature model trained from its local feature-enhanced data element set to the collaborative training center, while the original medical data remains locally at each university, thus preventing data privacy leaks from the source.
[0222] S52. Differential Privacy Noise Embedding: During the local model training process at each federated node, the privacy budget is dynamically set based on the privacy labels of the data elements. In addition to the noise figure, adaptive differential privacy noise is added through the noise variance calculation formula, and a low privacy budget is set for "high privacy" data. =0.5) and high noise figure, set high privacy budget for "medium privacy" and "low privacy" data ( With a low noise coefficient (=1.0 / 1.5), it achieves refined protection of data privacy.
[0223] S53. Global Model Aggregation and Update: The collaborative training center performs consistency checks on the feature models uploaded by each federated node, eliminating abnormal models that do not conform to the model parameter distribution pattern. For feature models that pass the check, the model parameters are globally aggregated and updated using the federated average parameter update formula. Weight coefficients are set according to the data scale and data quality of each node; the larger the data scale and the higher the data quality of the node, the higher the weight coefficient, ensuring the rationality of the global model aggregation. After aggregation is completed, the updated global model parameters are distributed to each federated node.
[0224] S54. Multi-round Collaborative Training and Model Convergence: Each federated node performs local fine-tuning training using local feature-enhanced data element sets based on the global model parameters issued by the collaborative training center. After training, the updated feature model is uploaded to the collaborative training center again, repeating the process of "local training - model upload - global aggregation - parameter distribution". In this embodiment, a total of 15 rounds of cross-school collaborative training are carried out. Model convergence is judged by convergence criteria. When the loss function of the global model converges to a stable value, and the parameter deviation between the local model and the global model of each node is less than the preset threshold ξ= At this point, collaborative training is stopped, and a preliminary AI adaptation model for university healthcare is output.
[0225] S6. Model Scenario-Based Validation and Iteration:
[0226] A systematic model scenario-based verification and iteration mechanism is constructed. The initial model is verified in all scenarios using independent real-world scenario datasets. Based on the verification results, targeted data element supplementation and model iteration optimization are completed, and the final medical AI adaptability model for universities is output. The specific implementation process is as follows:
[0227] S61. Construction of Independent Scenario-Based Validation Datasets: From the latest medical data of the medical university and its affiliated hospitals, datasets that were not used in model training were selected to construct independent validation datasets covering three major scenarios: clinical teaching simulation, scientific research data mining, and clinical auxiliary diagnosis. The validation datasets contain approximately 500,000 data elements, covering different departments and different disease types, to ensure the objectivity and practicality of the validation results.
[0228] S62. Model Scenario-Based Validation and Evaluation Report Generation: The preliminary AI-adaptive model for university medical settings is tested across all scenarios on a validation dataset. Performance metrics (accuracy, recall, F1 score, interpretability, feature diversity, etc.) and cross-scenario adaptability metrics are calculated for each scenario. The overall adaptability of the model is calculated using the overall model adaptability scoring formula. The model performance is compared with the preset requirements for university medical scenarios to identify performance shortcomings, such as insufficient interpretability of the model in clinical teaching scenarios and low accuracy in diagnosing rare diseases in auxiliary diagnostic scenarios. A model adaptability evaluation report is generated based on the test results to clarify the model's strengths and weaknesses, providing a basis for subsequent iterative optimization.
[0229] S63. Targeted Data Element Supplementation: Based on the model adaptability assessment report, determine the type and quantity of supplemented data elements through the data supplementation calculation formula. To address the issue of insufficient interpretability of the model in clinical teaching scenarios, supplement teaching case data with detailed diagnostic and treatment logic. To address the issue of low accuracy in rare disease diagnosis of the model in auxiliary diagnostic scenarios, collaborate with partner universities to supplement clinical diagnosis and treatment data for rare diseases. Feedback the types and quantities of data elements to be supplemented to step S1 to complete targeted data element governance and preprocessing, and generate a supplemented feature-enhanced data element set.
[0230] S64. Model Iteration Optimization and Final Version Output: Based on the supplemented feature-enhanced data set, the initial medical AI adaptability model for universities is retrained and fine-tuned to optimize the model's network structure and parameter settings, focusing on strengthening the feature extraction and decision-making capabilities corresponding to the model's performance shortcomings. After fine-tuning, the model is tested again on the validation dataset until the model's performance indicators in each scenario and cross-scenario adaptability indicators all meet the preset requirements, realizing the model's iterative upgrade and outputting the final medical AI adaptability model for universities.
[0231] S7. Correlation analysis between data elements and model performance:
[0232] A module for analyzing the correlation between data elements and model performance was constructed to explore the relationship between data element characteristics, quality, and model performance in various scenarios. This module generates a data element optimization guide, providing standardized data element selection and optimization criteria for subsequent training of medical AI models in universities. The specific implementation process is as follows:
[0233] S7.1 Association Analysis Module Construction: Construct a data element and model performance association analysis module based on the association rule mining algorithm (Apriori). Use the feature indicators of data elements (such as feature dimension, feature discrimination, cross-modal fusion degree) and quality indicators (such as data integrity, annotation accuracy, scene matching degree) as antecedents, and use the scene-specific performance indicators of the model (such as accuracy, interpretability, diversity, cross-scene adaptability) as consequents. Set the minimum support min_sup=0.2 and the minimum confidence min_conf=0.8 to set the filtering conditions for association rule mining.
[0234] S72. Association Rule Mining and Analysis: Input all data element index data and model performance index data generated in each step of this invention into the association analysis module, and use the Apriori algorithm to mine association rules. Extract strong association rules between data elements and model performance through support and confidence calculation formulas, such as "data integrity ≥ 95% and annotation accuracy ≥ 99% → accuracy of auxiliary diagnostic scenario model ≥ 92%", "cross-modal feature alignment ≥ 0.9 → cross-scenario adaptability ≥ 88%", "data feature diversity in scientific research analysis ≥ 0.85 → F1 value of scientific research analysis scenario model ≥ 90%", etc. Analyze the mined association rules, and use threshold optimization formulas to identify the key data element features and quality thresholds that affect the performance of the model in each scenario, and determine the optimal combination of data elements for training models in different scenarios.
[0235] S73. Data Element Optimization Guideline Generation: Based on the results of association rule mining and analysis, a data element optimization guideline for training medical AI models in universities is generated. The guideline clarifies the core data element types, feature requirements, quality thresholds, and optimal combination methods for model training in three major scenarios: clinical teaching, scientific research analysis, and assisted diagnosis. It also outlines key technologies and parameter settings for data element governance and feature enhancement. Furthermore, the guideline standardizes cross-university collaborative sharing of data elements and privacy and security protection, forming a reusable and scalable data element optimization system for medical universities. This provides a standardized basis for data element selection and optimization for subsequent medical AI model training at this medical university and other universities.
[0236] This invention also constructs a data element-oriented medical AI model adaptation training optimization system for universities, modularizing and systematizing the above seven steps. The system includes a data element governance and preprocessing module, a data element feature enhancement module, a hierarchical adaptation training network module, a dynamic feedback optimization module, a privacy and security collaborative training module, a scenario-based verification and iteration module, a data element and model performance correlation analysis module, and a data storage module. The modules interact and collaborate with each other through distributed communication technology. The data storage module adopts a hybrid storage method of cloud storage + local storage to achieve secure storage and efficient retrieval of data such as datasets, model parameters, evaluation reports, and optimization guidelines throughout the entire process.
[0237] After being applied in medical universities and partner universities in this embodiment, the system has achieved full automation and intelligence in the training and optimization of medical AI models in universities. The final medical AI adaptability model constructed in universities has an interpretability of over 90% in clinical teaching scenarios, a feature mining diversity of over 88% in scientific research analysis scenarios, an average diagnostic accuracy of over 92% in assisted diagnosis scenarios, and a cross-scenario adaptability of over 89%. Compared with traditional general medical AI models, the performance indicators of each scenario have been improved by more than 10%, effectively meeting the development needs of the integration of medical education and research in universities.
[0238] Furthermore, the optimization method and system of this invention have strong scalability and versatility. The parameters and network structure of each module can be fine-tuned according to the medical scenario needs, data scale and data characteristics of different universities. It is applicable to the construction and optimization of medical AI models in various medical colleges, medical departments of comprehensive universities and their affiliated medical institutions. It provides a standardized technical solution for the research and development and implementation of medical AI technology in universities, and promotes the resource-based and value-based utilization of medical data elements in universities.
[0239] Although embodiments of the invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made to these embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the appended claims and their equivalents.
Claims
1. A data-element-oriented method for optimizing the adaptability training of AI models in medical fields in universities, characterized in that... Includes the following steps: S1. Data Element Governance and Preprocessing: Construct a data element governance system for medical data in universities, standardize, classify, and de-identify multi-source heterogeneous medical data in universities, generate standardized data element sets, and retain data traceability information; S2. Data Feature Enhancement: Establish a data feature enhancement module to perform feature alignment, missing value completion, and dimension enhancement on the standardized data feature set, generate a feature-enhanced data feature set, and simultaneously establish and update the data feature index library. S3. Construction of hierarchical adaptation training network: Design a hierarchical adaptation training network, adopt an encoder-multi-scale scene decoder structure, receive feature-enhanced data element set, and construct a scenario-based loss function in combination with the needs of clinical teaching, scientific research analysis and auxiliary diagnosis scenarios in colleges and universities, and output the initial version of the scene-adapted medical AI model. S4. Dynamic Feedback Optimization of Data Elements: Establish a dynamic feedback mechanism for data elements to evaluate the quality of data elements in real time during model training, generate data element optimization strategies and feed them back to the data element governance and preprocessing stage. At the same time, based on the model performance feedback, the hyperparameters of the hierarchical adaptation training network are dynamically adjusted by the algorithm. S5. Privacy-Security Collaborative Training: Construct a privacy-security training framework based on federated learning and differential privacy technology to achieve cross-university collaborative training of medical data elements from multiple universities. While shielding the original information of the data, it integrates the features of cross-university data elements and outputs a preliminary university medical AI adaptability model. S6. Model Scenario-based Validation and Iteration: The initial medical AI adaptability model for universities will be standardized and validated in different medical scenarios in universities, generating a model adaptability evaluation report. Based on the report, data elements will be supplemented and the model will be iterated and optimized to output the final medical AI adaptability model for universities. S7. Correlation Analysis between Data Elements and Model Performance: Construct a module for correlation analysis between data elements and model performance. Through association rule mining algorithms, analyze the relationship between data element characteristics, quality and model scenario-based performance, and generate data element optimization guidelines to provide a basis for data element selection and optimization for subsequent training of medical AI models in universities.
2. The data element-oriented method for optimizing the adaptability training of university medical AI models according to claim 1, characterized in that, The specific process of step S1 is as follows: S11. Perform data cleaning on multi-source heterogeneous medical data from universities, remove redundant, erroneous, and duplicate data, and complete data format unification and grayscale normalization. S12. Combining the ICD-11 disease classification standard with the needs of university medical scenarios, construct a data element labeling system for university medical data, perform structured labeling on the data, and add scenario labels, hierarchical labels and privacy labels to each data element; S13. Based on the privacy label, use pseudonymization, data generalization, and local differential privacy technology to perform differentiated privacy desensitization processing, encrypt and store the elements with hierarchical labels as the core data, and retain the traceability index and association relationship of the data elements; S14. Classify the desensitized dataset according to the scenario labels, generate standardized data element sets for clinical teaching, scientific research analysis, and auxiliary diagnosis scenarios, and establish a data element governance ledger.
3. The data element-oriented method for optimizing the adaptability training of university medical AI models according to claim 1, characterized in that, The specific process of step S2 is as follows: S21. Based on the data feature index library, the mutual information maximization method is used to perform cross-modal feature alignment on the standardized data feature set to achieve feature space unification of image, text and test data. S22. Combining medical domain knowledge with generative adversarial networks to complete missing values in medical data, and using an attention mechanism to enhance the dimensionality of low-dimensional data elements to generate high-dimensional feature vectors. S23. Use the random forest feature importance ranking algorithm to filter the data, remove invalid and redundant features, and generate a feature-enhanced data element set. S24. Add weight labels to each feature in the feature enhancement data element set. The weight values are determined by the weighted sum of the clinical actual value of the data element and its contribution to model training. Update the feature information to the data element feature index library.
4. The data element-oriented method for optimizing the adaptability training of university medical AI models according to claim 1, characterized in that, The specific process of step S3 is as follows: S31. Normalize the feature-enhanced data element set according to the feature weight labels to form a multi-channel tensor as the unified input of the hierarchical adaptation training network. S32. Construct a depthwise separable convolutional encoder structure to extract global features and local detail features of data elements layer by layer, and introduce a skip connection mechanism to retain shallow fine-grained feature information. S33. Construct a multi-scale scene decoder structure, design independent decoding branches for clinical teaching, scientific research analysis, and auxiliary diagnosis scenarios, and perform targeted feature fusion for each branch according to scenario requirements; S34. Construct a scenario-based loss function consisting of weighted interpretability loss, diversity loss, diagnostic accuracy loss, and cross-scenario consistency loss. This function, along with the outputs of each scenario decoding branch, participates in supervised network training to output an initial version of the scenario-adapted medical AI model.
5. The data element-oriented method for optimizing the adaptability training of university medical AI models according to claim 1, characterized in that, The specific process of step S4 is as follows: S41. Construct a data element quality assessment index system that includes data integrity, annotation accuracy, feature discrimination, and scene matching, and conduct real-time quantitative assessment of data elements during the training process. S42. If the data element quality assessment result is lower than the preset threshold, generate optimization strategies such as re-labeling, secondary feature enhancement, and scenario-based data supplementation, and feed them back to step S1 to complete the closed-loop optimization of data elements. S43. Construct a model performance evaluation index system that includes scenario-based accuracy, recall, F1 score and cross-scenario adaptability, and monitor the performance of the initial version of the model in real time. S44. Based on the model performance evaluation results, the Bayesian optimization algorithm is used to dynamically adjust the hyperparameters of the hierarchical adaptation training network, such as the learning rate, batch size, and number of convolutional kernels.
6. The data element-oriented method for optimizing the adaptability training of university medical AI models according to claim 1, characterized in that, The specific process of step S5 is as follows: S51. Construct a federated learning architecture and set up a collaborative training center for medical data elements in universities. Each university acts as a federated node and only uploads the feature model of its local feature-enhanced data element set to the collaborative training center, while the original data is kept locally. S52. During the local training process at each federation node, the privacy budget and noise coefficient are dynamically set according to the data privacy label, and adaptive differential privacy noise is added. S53. The Collaborative Training Center performs consistency verification on the feature models uploaded by each federated node, uses the federated averaging algorithm to complete the global aggregation and update of model parameters, and distributes the updated global model parameters to each federated node. S54. Each federation node performs local fine-tuning training based on global model parameters, repeating the process of "local training - model upload - global aggregation - parameter distribution" until the model converges, and outputs a preliminary medical AI adaptability model for universities.
7. The data element-oriented method for optimizing the adaptability training of university medical AI models according to claim 1, characterized in that, The specific process of step S6 is as follows: S61. Construct an independent scenario-based validation dataset covering clinical teaching simulation, scientific research data mining, and clinical auxiliary diagnosis, and the dataset is not used for model training; S62. Test the preliminary medical AI adaptability model in the validation dataset, calculate the performance indicators of each scenario and the cross-scenario adaptability indicators, and generate a model adaptability evaluation report. S63. Based on the evaluation report, identify the shortcomings in model performance, determine the type and quantity of data elements to be supplemented, and feed back to step S1 to complete the targeted data element supplementation. S64. Based on the supplemented feature-enhanced data element set, the preliminary model is retrained and fine-tuned to achieve iterative upgrade of the model and output the final medical AI adaptability model for universities.
8. The data element-oriented method for optimizing the adaptability training of university medical AI models according to claim 1, characterized in that, The specific process of step S7 is as follows: S71. Construct a data element and model performance correlation analysis module based on the Apriori algorithm for association rule mining, taking the feature indicators and quality indicators of data elements as antecedents and the scenario-based performance indicators of the model as consequents, and setting minimum support and minimum confidence. S72. Input the data element indicator data and model performance indicator data into the correlation analysis module to explore the strong correlation rules between data elements and model performance, and analyze and clarify the key data element characteristics and quality thresholds that affect the performance of the model in various scenarios. S73. Based on the results of association rule mining and analysis, generate a guide to optimize data elements for training AI models in medical universities, clarifying the optimal combination of data elements and related technical parameter settings for model training in different scenarios.
9. A data-element-oriented adaptive training and optimization system for university medical AI models, employing the data-element-oriented adaptive training and optimization method for university medical AI models as described in any one of claims 1-8, characterized in that... The system includes: The data element governance preprocessing module is used to clean, standardize, classify, desensitize, and classify multi-source heterogeneous medical data from universities, generate standardized data element sets, and establish a data element governance ledger. The data feature enhancement module communicates with the data feature governance preprocessing module to realize cross-modal feature alignment, missing value completion, dimension enhancement and invalid feature filtering of standardized data feature sets, generate feature-enhanced data feature sets and update the data feature index library; The hierarchical adaptation training network module communicates with the data feature enhancement module and adopts an encoder-multi-scale scene decoder structure to complete the hierarchical scene-based training of the model and output the initial version of the scene-adapted medical AI model. The dynamic feedback optimization module communicates with the data element governance and preprocessing module and the hierarchical adaptation training network module respectively. It is used to evaluate the data element quality and model performance in real time, and realize closed-loop optimization of data elements and adaptive adjustment of training network hyperparameters. The privacy and security collaborative training module communicates with the hierarchical adaptation training network module to realize cross-university privacy and security collaborative training of medical data elements from multiple universities, and outputs a preliminary medical AI adaptability model for universities. The scenario-based verification iteration module communicates with the privacy and security collaborative training module and the data element governance preprocessing module to complete the full-scenario standardized verification of the model and realize data element supplementation and model iterative optimization. The data element and model performance correlation analysis module communicates with the above modules to explore the correlation between data elements and model performance and generate data element optimization guidelines. The data storage module communicates and connects with all the above modules to store the entire process of data, including datasets, model parameters, evaluation reports, optimization guidelines, etc., enabling secure data storage and efficient retrieval.
10. The data element-oriented college medical AI model adaptation training optimization system according to claim 9, characterized in that, The data element governance preprocessing module includes a data cleaning and normalization unit, a standardization and annotation unit, a privacy desensitization and encryption unit, and a scenario-based classification and ledger unit; the hierarchical adaptive training network module includes an input layer unit, a depthwise separable convolutional encoder unit, a multi-scale scene decoder unit, and a scenario-based loss function unit; the privacy and security collaborative training module includes a federated learning architecture unit, a differential privacy noise embedding unit, a global model aggregation unit, and a multi-round collaborative training unit.