An intelligent medical auxiliary method and system based on big data analysis
By transforming multi-source medical data into a unified dimensional feature representation through big data analysis, and employing clustering algorithms and prediction models, a disease progression prediction path is constructed, and a risk assessment list is generated. This solves the problem of the difficulty in unified analysis of multi-source medical data and dynamic mining of cross-source feature associations, and realizes efficient disease risk prediction of intelligent medical assistance system.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- HUNAN XIANGYUE HEALTH TECH CO LTD
- Filing Date
- 2026-04-08
- Publication Date
- 2026-06-16
AI Technical Summary
Existing medical support systems struggle to effectively integrate medical information from multiple sources, causing doctors to spend a significant amount of time on diagnosis and treatment and unable to quickly obtain comprehensive and consistent reference information, thus affecting the comprehensiveness and efficiency of diagnosis.
By using big data analytics, we acquire multi-source medical data, convert it into a unified feature representation, use clustering algorithms to group the data and extract co-occurrence feature patterns across data sources, construct a set of relationships, build a disease progression prediction path based on this, generate a risk assessment list, and dynamically optimize it through a data update module.
It enables the automatic discovery of deep correlations from multi-source heterogeneous data, dynamically predicts disease risks, improves the practicality and accuracy of medical assistance systems, and forms a closed-loop intelligent medical assistance system.
Smart Images

Figure CN121983296B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of medical technology, and in particular discloses an intelligent medical assistance method and system based on big data analysis. Background Technology
[0002] In the medical field, the application of information and intelligent technologies is becoming a key pillar for improving the quality and efficiency of diagnosis and treatment. Many medical support systems are now attempting to use data to support physician decision-making, but these methods often face deep-seated integration challenges, particularly when processing information from complex sources. This makes it difficult to achieve comprehensive collaboration, resulting in physicians still needing to spend a significant amount of time piecing together and interpreting information during actual diagnosis and treatment.
[0003] A significant drawback of existing solutions is their inability to effectively integrate and deeply analyze medical information from diverse sources and formats. For example, electronic medical records, imaging data, and daily patient monitoring data operate independently, lacking a unified processing mechanism. This makes it difficult for doctors to quickly obtain comprehensive and consistent reference information when faced with complex cases. This deficiency not only increases the workload of medical staff but also risks causing patients to miss the optimal treatment window.
[0004] Focusing on the technical challenges, a core challenge in the medical field lies in effectively integrating information from multiple sources to form meaningful auxiliary judgments. The primary problem is the vast differences in the format and content of this information. For example, text records and images have completely different requirements for storage and analysis, making them difficult to process within the same framework. At a deeper level, these differences further hinder the discovery of connections between information. For instance, a test result may be closely related to a patient's long-term lifestyle habits, but due to a lack of cross-disciplinary information connections, doctors often cannot detect this hidden relationship in a short time, thus affecting the comprehensiveness of the diagnosis.
[0005] Therefore, how to achieve unified processing of information from multiple sources in terms of technology, and on this basis, uncover hidden connections, has become a key issue in improving the practicality and accuracy of medical assistance systems. Summary of the Invention
[0006] This invention provides an intelligent medical assistance method and system based on big data analysis, aiming to solve the technical problem that multi-source medical data is difficult to analyze in a unified manner and cannot dynamically mine cross-source feature associations to predict the risk of disease development.
[0007] One aspect of the present invention relates to an intelligent medical assistance method based on big data analysis, comprising the following steps:
[0008] S100. Acquire multi-source medical data and convert the multi-source medical data into a unified dimension feature representation through a data processing module to obtain a standardized medical dataset.
[0009] S200. Use clustering algorithms to group the features in the standardized medical dataset to obtain multiple feature clusters;
[0010] S300: Extract co-occurrence feature patterns across data sources from multiple feature clusters, and determine them as associations when their statistical correlation exceeds a preset threshold to obtain a set of associations;
[0011] S400. Based on the set of relationships, a prediction model is used to construct a disease development prediction path, and trend indicators are obtained by traversing the path nodes to determine the disease development trend sequence.
[0012] S500: Select nodes whose risk indicators exceed the preset level from the disease development trend sequence as high-risk nodes, and integrate multiple risk factors to calculate potential risk values to generate a risk assessment list.
[0013] S600. Generate a comprehensive report based on the risk assessment list, and use a priority sorting method to determine the display order of each reference item in the comprehensive report;
[0014] S700 outputs auxiliary judgment results based on the comprehensive report, and incorporates new data through the data update module to dynamically optimize the auxiliary judgment results.
[0015] Further, step S100 includes:
[0016] S110. Acquire multi-source medical data transmitted from heterogeneous medical information terminals. The multi-source medical data includes structured physiological parameters and unstructured clinical text.
[0017] S120. Convert unstructured clinical text into discrete semantic encoded sequences and align them with structured physiological parameters to form a multimodal raw data matrix.
[0018] S130. Project the multimodal original data matrix using a dimension alignment matrix to obtain a unified high-dimensional feature vector;
[0019] S140. Perform normalization encoding on the unified high-dimensional feature vector to generate a standardized feature representation, and construct a standardized medical dataset based on the standardized feature representation.
[0020] Further, step S200 includes:
[0021] S210. Obtain a standardized medical dataset and calculate the Euclidean distance between features in the standardized medical dataset to determine the feature similarity matrix;
[0022] S220. Construct a feature density space based on the feature similarity matrix and determine the initial cluster centers;
[0023] S230. A clustering algorithm is used to iteratively partition the standardized medical dataset based on the initial cluster centers to obtain initial feature groups;
[0024] S240. If the intra-cluster cohesion of the initial feature group is greater than the preset cohesion threshold and the inter-cluster separation of the initial feature group is less than the preset separation threshold, then a merging operation is performed to obtain multiple feature clusters.
[0025] Further, step S300 includes:
[0026] S310. Obtain the source feature index table constructed from multiple feature clusters;
[0027] S320. Generate a cross-source feature co-occurrence matrix based on the source feature index table;
[0028] S330. Perform frequent pattern mining on the cross-source feature co-occurrence matrix to extract co-occurrence feature patterns across data sources;
[0029] S340. Calculate the statistical correlation of co-occurrence feature patterns across data sources. If the statistical correlation exceeds the preset correlation judgment threshold, the co-occurrence feature patterns across data sources are judged as related relationships to obtain a set of related relationships.
[0030] Further, step S400 includes:
[0031] S410. Extract clinical representation nodes based on the set of association relationships;
[0032] S420. If the state transition probability between clinical representation nodes is greater than the preset transition threshold, then connect the clinical representation nodes to construct a disease progression prediction path.
[0033] S430. Traverse the disease progression prediction path to extract disease stage characteristics, and obtain trend indicators based on disease stage characteristics;
[0034] S440. If the time-series evolution trajectory generated based on the trend indicators conforms to the preset deterioration direction, then the disease development trend sequence is determined.
[0035] Further, step S500 includes:
[0036] S510. Extract abnormal physiological features from the disease progression trend sequence to calculate risk indicators;
[0037] S520. If the risk indicator is greater than the preset risk level threshold, then mark the node corresponding to the risk indicator as a high-risk node.
[0038] S530: Obtain the deterioration rate corresponding to high-risk nodes to extract multiple risk factors;
[0039] S540. Multiple risk factors are fused using a weighting matrix to obtain the potential risk value;
[0040] S550. Sort the potential risk values in descending order to generate a risk assessment list.
[0041] Further, step S600 includes:
[0042] S610. Obtain potential risk values and pathological descriptions from the risk assessment list;
[0043] S620. Search the medical knowledge base based on pathological description information to construct a set of reference items to be sorted;
[0044] S630. Quantitatively analyze the set of reference items to be ranked using potential risk values to obtain severity and urgency values.
[0045] S640. Calculate the ranking weights based on the severity and urgency values;
[0046] S650. Sort the set of reference items to be sorted in descending order according to the sorting weight to obtain an ordered sequence of reference items;
[0047] S660. Determine the display order of each reference item according to the ordered reference item sequence and generate a comprehensive report. The comprehensive report is generated by encapsulating the ordered reference item sequence into a structured template.
[0048] Further, step S700 includes:
[0049] S710. Parse the ordered reference item sequence in the comprehensive report and map the ordered reference item sequence to the diagnosis and treatment suggestion library to generate initial auxiliary judgment results;
[0050] S720: Collect real-time monitoring values associated with the initial auxiliary judgment results and convert them into new data feature vectors;
[0051] S730. Based on the deviation values between the new data feature vector and the benchmark feature vector, the initial auxiliary judgment results are reweighted to obtain the corrected reference item sequence;
[0052] S740 encapsulates the corrected reference item sequence and outputs dynamically optimized auxiliary judgment results.
[0053] Another aspect of the present invention relates to an intelligent medical assistance system based on big data analysis, for performing the above-described intelligent medical assistance method based on big data analysis, comprising:
[0054] The standardized medical dataset acquisition module is used to acquire multi-source medical data and then convert the multi-source medical data into a unified dimension feature representation through a data processing module to obtain a standardized medical dataset.
[0055] A multiple feature cluster acquisition module is used to group features in a standardized medical dataset using a clustering algorithm to obtain multiple feature clusters;
[0056] The association set acquisition module is used to extract co-occurrence feature patterns across data sources from multiple feature clusters, and determine them as associations when their statistical correlation exceeds a preset threshold to obtain an association set;
[0057] The disease development trend sequence determination module is used to construct a disease development prediction path based on a set of association relationships and a prediction model, and to determine the disease development trend sequence by traversing the path nodes and obtaining trend indicators.
[0058] The risk assessment list generation module is used to screen nodes whose risk indicators exceed preset levels from the disease development trend sequence as high-risk nodes, and integrate multiple risk factors to calculate potential risk values to generate a risk assessment list.
[0059] The display order determination module is used to generate a comprehensive report based on the risk assessment list and to determine the display order of each reference item in the comprehensive report using a priority sorting method.
[0060] The auxiliary judgment result output and dynamic optimization module is used to output auxiliary judgment results based on the comprehensive report, and to incorporate new data through the data update module to dynamically optimize the auxiliary judgment results.
[0061] The beneficial effects achieved by this invention are as follows:
[0062] 1. The intelligent medical assistance method and system based on big data analysis provided by this invention addresses the problem that multi-source medical data is difficult to analyze uniformly and cannot dynamically mine cross-source feature associations to predict the risk of disease development. By converting multi-source data into a unified dimension feature representation and performing clustering, co-occurrence feature patterns across data sources are extracted and a set of association relationships is constructed. Based on this set of association relationships, a disease development prediction path is constructed and a trend sequence is obtained. High-risk nodes are screened and multi-factor calculations are performed to generate a risk assessment list. Finally, a comprehensive report is generated based on priority and auxiliary judgment results are output. At the same time, the results are dynamically optimized by incorporating new data, achieving the technical effect of automatically discovering deep associations from multi-source heterogeneous data and dynamically predicting the risk of disease development.
[0063] 2. This invention clarifies the core parameters, structure, and training and verification rules of clustering algorithms and LSTM prediction models, supplements the determination basis and scenario-based adjustment rules for various preset thresholds, discloses the full-process implementation details of data processing and data update modules, constructs a standardized and updatable medical knowledge base / treatment suggestion base, and clarifies semantic matching and data association logic, forming a closed-loop intelligent medical assistance system from data collection to assisted diagnosis. This ensures that those skilled in the art can implement the solution of this invention according to the specification, effectively solving the pain points of low efficiency, data fragmentation, and delayed response in traditional medical assisted diagnosis, and is applicable to intelligent assisted diagnosis and treatment scenarios for various diseases. Attached Figure Description
[0064] Figure 1 This is a flowchart illustrating an embodiment of the intelligent medical assistance method based on big data analysis according to the present invention.
[0065] Figure 2 This is a functional block diagram of an embodiment of the intelligent medical assistance system based on big data analysis of the present invention.
[0066] Explanation of icon numbers:
[0067] 10. Standardized medical dataset acquisition module; 20. Multiple feature cluster acquisition module; 30. Association set acquisition module; 40. Disease development trend sequence determination module; 50. Risk assessment list generation module; 60. Display order determination module; 70. Auxiliary judgment result output and dynamic optimization module. Detailed Implementation
[0068] To better understand the above technical solutions, the following will provide a detailed explanation of the technical solutions in conjunction with the accompanying drawings and specific implementation methods.
[0069] like Figure 1 As shown, the first embodiment of the present invention proposes an intelligent medical assistance method based on big data analysis, including the following steps:
[0070] S100. Acquire multi-source medical data and convert it into a unified-dimensional feature representation through a data processing module to obtain a standardized medical dataset.
[0071] This step involves the collection and standardized integration of multi-source medical data. It utilizes multiple channels, including Hospital Information System (HIS), Laboratory Information System (LIS), Picture Archiving and Communication System (PACS), and wearable device monitoring platforms, to acquire multi-source medical data encompassing structured data (such as laboratory indicators, diagnostic codes, and medication records), semi-structured data (such as medical record texts and examination reports), and unstructured data (such as imaging data and physiological signal waveforms).
[0072] Data processing modules (such as ETL (Extraction-Transformation-Loading) tools or large-scale model semantic parsing engines) are used to unify the processing of multi-source medical data. This includes: data cleaning (removing outlier fields with ≥90% missing values and duplicate records), data alignment (unifying timestamps, patient IDs, and terminology encoding), and dimensionality normalization (converting data of different dimensions and formats into fixed-dimensional feature vectors, with the feature vector dimensions uniformly ranging from 512 to 1024, dynamically adjusted according to the type of medical data). The final result is a standardized medical dataset with unified format, consistent semantics, and standardized dimensions, providing a high-quality input foundation for subsequent clustering analysis and association mining.
[0073] Standardized medical datasets are datasets with unified format, unified dimensions, and unified semantics obtained by cleaning, aligning, and normalizing multi-source heterogeneous medical data (structured, semi-structured, and unstructured). They form the basis for subsequent data analysis.
[0074] When converting unstructured clinical text into discrete semantic encoding sequences, a BERT model fine-tuned with a medical domain-specific corpus is used. The clinical semantic understanding capability is improved through MLM+NER dual-task fine-tuning. During word segmentation, a medical professional lexicon is loaded to avoid splitting professional terms. The dimension of the encoding sequence is aligned with the structured physiological parameters. Temporal alignment is achieved with timestamps as the core to achieve one-to-one matching, ensuring the effective fusion of multimodal data.
[0075] S200. Use a clustering algorithm to group the features in the standardized medical dataset to obtain multiple feature clusters.
[0076] This step involves clustering and grouping medical features. Based on a standardized medical dataset, the density-based spatial clustering algorithm (DBSCAN) is used to perform unsupervised grouping of high-dimensional feature vectors. The algorithm parameters are set according to the distribution of medical data features and clinical statistical patterns.
[0077] 1. Neighborhood radius (ε): 0.5~0.8 (controls the density range of feature clusters), 0.5~0.6 for structured data of physiological parameters, and 0.7~0.8 for unstructured data of clinical text;
[0078] 2. Minimum number of samples (MinPts): 5~15 (to ensure the statistical significance of the feature clusters), 10~15 for common disease feature clusters, and 5~9 for rare disease feature clusters.
[0079] The algorithm iteratively calculates the similarity between features (e.g., cosine similarity ≥ 0.7), aggregating medical features with high similarity and strong business relevance (e.g., "history of hypertension," "blood pressure fluctuation value," "frequency of antihypertensive drug use") into a feature cluster. Ultimately, 5-20 logically independent feature clusters are generated, each corresponding to a set of medical features (e.g., cardiovascular feature cluster, metabolic feature cluster, respiratory system feature cluster), achieving structured classification of medical features. If the DBSCAN algorithm is unsuitable due to low data density / high noise, the improved K-Means++ algorithm is used as an alternative. The elbow rule is used to determine the optimal number of clusters, and clinical feature weights are introduced to avoid meaningless features dominating the clustering results.
[0080] Feature clusters are sets formed by aggregating highly similar medical features using clustering algorithms. Each feature cluster represents a type of medical feature with business relevance (such as cardiovascular feature clusters, metabolic feature clusters). The preset threshold for intra-cluster cohesion is 0.6, and the preset threshold for inter-cluster separation is 0.3. These thresholds are adjusted according to the data type and disease type. If the initial feature grouping satisfies the condition that intra-cluster cohesion is greater than the preset threshold and inter-cluster separation is less than the preset threshold, a merging operation is performed to obtain the final feature clusters.
[0081] S300: Extract co-occurrence feature patterns across data sources from multiple feature clusters, and determine them as associations when their statistical correlation exceeds a preset threshold to obtain a set of associations.
[0082] This step is the cross-source feature association identification step. For the multiple feature clusters generated in step S200, the Apriori association rule mining algorithm and mutual information entropy calculation method are used to extract co-occurrence feature patterns across data sources (such as the co-occurrence of "diabetic history" and "foot ulcer", and the co-occurrence of "dyslipidemia" and "coronary heart disease diagnosis").
[0083] The preset statistical relevance thresholds were developed in conjunction with clinical practice guidelines, dataset statistical analysis, and clinical expert consensus. Specifically:
[0084] 1. Mutual information entropy threshold: 0.3~0.5 (representing the strength of the association between features), 0.4~0.5 for common diseases, and 0.3~0.4 for rare diseases;
[0085] 2. Support threshold: 5%~20% (representing the frequency of co-occurrence of features in the dataset), 10%~20% for large-scale sample data, and 5%~9% for small-scale sample data;
[0086] 3. Confidence threshold: 0.6~0.9 (representing the confidence level of the association rule), 0.8~0.9 for diagnostic feature association and 0.6~0.7 for prognostic feature association.
[0087] When the statistical correlation (support × confidence) of co-occurring feature patterns exceeds a preset threshold, a correlation is determined to exist between the two features. All correlations are integrated to generate a correlation set (the set contains 10-50 correlations), clarifying the intrinsic relationships between different medical features and different data sources, and providing core correlation evidence for disease prediction.
[0088] Co-occurrence feature patterns refer to two or more features that appear simultaneously in a medical dataset, reflecting the coexistence patterns between different medical features and serving as the core basis for identifying associations.
[0089] S400. Based on the set of relationships, a prediction model is used to construct a disease development prediction path, and trend indicators are obtained by traversing the path nodes to determine the disease development trend sequence.
[0090] This step involves constructing the disease progression prediction path and analyzing trends. Based on the set of associations, a disease progression prediction model is constructed using a 3-layer stacked Long Short-Term Memory (LSTM) network. Taking a standardized medical dataset as input, the model learns the evolution of medical features over time, constructing a disease progression prediction path in the form of a Directed Acyclic Graph (DAG). The LSTM model's input layer dimension is consistent with the standardized feature vectors. The hidden layers employ a 3-layer stacked structure with Dropout and BatchNorm layers added to prevent overfitting. An attention mechanism is introduced to enhance the capture of high-risk features. The output layer outputs the state transition probabilities of clinical representation nodes.
[0091] The model training dataset requires a single disease sample size of ≥5000 cases, including complete longitudinal follow-up data, and is divided into training, validation, and test sets in a 7:2:1 ratio. The training process uses the Adam optimizer, cross-entropy loss function combined with L2 regularization, and early stopping to avoid overfitting. Model validation is achieved through both algorithm accuracy and clinical effectiveness, ensuring that the prediction results meet both algorithmic requirements and actual clinical needs.
[0092] In the disease progression prediction path, each node represents a disease state (e.g., "normal blood pressure," "mildly elevated blood pressure," "severely elevated blood pressure"), and directed edges between nodes represent state transition relationships. By traversing all nodes in the disease progression prediction path, trend indicators (e.g., rate of change, fluctuation amplitude, cumulative effect value) corresponding to each node are extracted. Combined with time series analysis methods (e.g., moving average method, exponential smoothing method), an ordered disease progression trend sequence is calculated. This disease progression trend sequence quantitatively reflects the dynamic evolution of the disease from stable to progressive to deteriorating, providing data support for risk assessment.
[0093] In the disease progression prediction path, each node represents a disease state node (e.g., "normal blood pressure," "mildly elevated blood pressure," "severely elevated blood pressure"), and directed edges between nodes represent state transition relationships. If the state transition probability between clinical characteristic nodes is greater than a preset threshold of 0.6 (0.5~0.6 for acute diseases, 0.6~0.7 for chronic diseases), then the nodes are connected to construct the prediction path. By traversing all nodes in the disease progression prediction path, trend indicators corresponding to each node (e.g., indicator change rate, fluctuation amplitude, cumulative effect value) are extracted. Combined with time series analysis methods (e.g., moving average method, exponential smoothing method), an ordered disease progression trend sequence is calculated. This disease progression trend sequence quantitatively reflects the dynamic evolution of the disease from stable → progressive → deteriorating, providing data support for risk assessment.
[0094] The disease progression prediction path is a directed graph structure built upon a set of relationships, reflecting the evolution of the disease from a normal state to different risk states. Path nodes represent disease states, and edges represent state transitions. Trend indicators are used to quantitatively describe the trend of disease progression, including indicator change rate, fluctuation amplitude, and cumulative effect value, and are the core parameters for judging the trend of disease progression.
[0095] S500: Select nodes whose risk indicators exceed the preset level from the disease development trend sequence as high-risk nodes, and integrate multiple risk factors to calculate potential risk values to generate a risk assessment list.
[0096] This step involves high-risk node screening and risk value assessment. Based on the disease progression sequence, a preset risk level threshold is established, incorporating clinical normal reference ranges and statistical patterns from medical time-series data.
[0097] 1. Indicator deviation threshold: ±15%~±30% (referring to the deviation of the current indicator value from the normal benchmark value), ±15%~±20% for critical care medicine scenarios, and ±25%~±30% for routine physical examination scenarios;
[0098] 2. Trend acceleration threshold: 0.05~0.1 (referring to the degree of increase in the rate of change of the indicator), 0.05~0.07 for malignant tumors and cardiovascular emergencies, and 0.08~0.1 for the stable period of chronic diseases.
[0099] Nodes with risk indicators exceeding preset levels are identified and marked as high-risk nodes. Subsequently, multiple risk factors (such as the rate of disease deterioration, the number of comorbidities, and the severity of risk indicators) are integrated, and the Analytic Hierarchy Process (AHP) is used to calculate the weight of each risk factor. The potential risk value is then calculated using a weighted summation formula (the risk value ranges from 0 to 100 points, with higher scores indicating higher risk).
[0100] Based on the potential risk value, the risk is divided into 5 levels (Level I: 0~20 points, Level II: 21~40 points, Level III: 41~60 points, Level IV: 61~80 points, Level V: 81~100 points), and a risk assessment list is generated that includes the patient ID, high-risk node, risk level, and potential risk value. The length of the list is the same as the number of patients.
[0101] High-risk nodes are those points in the disease progression prediction path where risk indicators exceed preset levels, representing a high-risk disease state that the patient currently faces or is about to face. The potential risk value is a numerical value that quantifies the degree of risk of a patient's condition, calculated by integrating multiple risk factors, and ranges from 0 to 100 points, with higher scores indicating higher risk.
[0102] S600. Generate a comprehensive report based on the risk assessment list, and use a priority sorting method to determine the display order of each reference item in the comprehensive report.
[0103] This step involves generating a comprehensive report and prioritizing it. Based on the risk assessment list, it integrates the disease progression trend sequence, correlation set, and medical feature cluster information, and generates a comprehensive report according to the standard medical industry report template. The comprehensive report includes core modules such as basic patient information, disease summary, risk assessment results, and intervention recommendations.
[0104] When constructing a set of reference items to be ranked based on pathological description information, the medical knowledge base is built using ontology and a structured database. The data source is authoritative and traceable medical resources, and it is updated through a dual mechanism of automatic and manual updates. The treatment suggestion base is based on the medical knowledge base and associates treatment suggestions according to disease-risk level. The matching between pathological descriptions and the knowledge base adopts a two-layer semantic matching algorithm of medical word vector similarity and ontology relation matching. The matching results are screened in combination with individual patient characteristics to ensure the clinical applicability of the reference items.
[0105] A priority ranking method (such as a weighted ranking method based on information entropy or a clinical importance scoring method) is used to sort the display order of the reference items in the comprehensive report (such as "high-risk node alerts", "potential risk values", and "intervention recommendation priorities"). The ranking rules are as follows:
[0106] 1. Core risk items have the highest priority (ranking weight: 0.3~0.4).
[0107] 2. Disease trend items have the second highest priority (sorting weight: 0.25~0.35);
[0108] 3. Intervention recommendations have the lowest priority (ranking weight: 0.2~0.3).
[0109] Ensure that the information presented in the comprehensive report conforms to the logic of clinical diagnosis and treatment, and prioritize the presentation of the most critical content for diagnosis and treatment.
[0110] S700 outputs auxiliary judgment results based on the comprehensive report, and incorporates new data through the data update module to dynamically optimize the auxiliary judgment results.
[0111] This step involves supporting judgment output and dynamic optimization. The comprehensive report is transformed into visual aids for judgment (such as risk warning pop-ups, disease trend graphs, and intervention suggestion lists), and output to doctors' terminals, hospital management platforms, or patient-side apps to help clinicians quickly grasp the risks associated with patients' conditions.
[0112] New medical data is collected in real time through a data update module. The collection frequency is set according to the clinical scenario. After standardized cleaning, the new data is used to determine whether to execute a full-process iteration based on a triple trigger mechanism of data volume, time, and deviation. When iteration is triggered, the new data is mixed with 10% of the initial training data for incremental model training. The bottom feature extraction layer is frozen, and only the top network is trained. If new features appear, the bottom neurons are lightly unfrozen. After training, the model parameters, related relationships, prediction paths, and thresholds are updated to achieve real-time dynamic optimization of the auxiliary judgment results. The deviation threshold between the new data feature vector and the baseline feature vector is preset to 0.2 and is adjusted according to the clinical scenario. If the deviation exceeds the threshold, the initial auxiliary judgment results are reweighted to obtain a corrected reference item sequence.
[0113] The data update module is responsible for real-time collection of new medical data, updating of the standardized dataset, and triggering full-process iterative optimization. It is the core of achieving dynamic optimization for auxiliary judgment. The full-process iteration trigger conditions for the data update module are: meeting any one of the following thresholds—data volume, time, or deviation—triggers the S100-S600 full-process iteration. The data volume threshold is when the newly added, cleaned data volume reaches 10% of the initial standardized medical dataset sample size, or when the number of newly added valid data points for a single patient reaches 20 per iteration. The time threshold is 24 hours per iteration in ICU scenarios, 72 hours per iteration in general wards, and 30 days per iteration in outpatient follow-ups. The deviation threshold is when the deviation between the feature vector of the new data for a single patient and the baseline feature vector is ≥0.2. The incremental training method for the model is as follows: the newly added standardized medical data is mixed with 10% of the initial training data to form the incremental training dataset. The bottom feature extraction layers of the LSTM model and clustering algorithm are frozen, and only the top fully connected layer and attention layer are trained. If new clinical representation nodes / feature clusters are included, 20% of the neurons in the bottom feature extraction layer are unfrozen for lightweight training. The incremental training learning rate is 1 / 10 of the initial training learning rate. The size remains the same as the initial size. The number of training rounds is 10-20. An early stopping method is used (the model is stopped if the validation set loss does not decrease for 3 consecutive rounds). After training, the model parameters, the set of correlations, the disease progression prediction path, and the risk assessment threshold are updated.
[0114] Furthermore, in the intelligent medical assistance method based on big data analysis provided in this embodiment, step S100 includes:
[0115] S110. Acquire multi-source medical data transmitted from heterogeneous medical information terminals. The multi-source medical data includes structured physiological parameters and unstructured clinical text.
[0116] Multi-source medical data is obtained from heterogeneous medical information terminals such as electrocardiogram monitors and electronic medical record systems. For example, structured physiological parameters include values such as heart rate and blood pressure, while unstructured clinical texts are symptoms described in doctors' notes.
[0117] S120. Convert unstructured clinical text into discrete semantic encoded sequences and align them with structured physiological parameters to form a multimodal raw data matrix.
[0118] The original multimodal data matrix is formed using the following formula:
[0119] (1)
[0120] In formula (1), The original multimodal data matrix is a unified matrix after aligning the structured parameters with the text embeddings. It is a structured physiological parameter matrix that stores numerical medical data such as heart rate and blood pressure. It is a collection of unstructured clinical texts, storing text data such as doctors' notes and symptom descriptions; The text embedding function maps text to a fixed-dimensional vector. The control logic of formula (1) is the modal fusion and structured expression logic of multi-source heterogeneous medical data. Its core values are: 1. Modal unification: transforming heterogeneous numerical and textual data into a single matrix form to eliminate data type differences; 2. Information integrity: preserving both the quantitative information of physiological parameters and the semantic information of clinical texts; 3. Downstream adaptation: directly serving as input for models such as neural networks and clustering algorithms, improving the compatibility and efficiency of medical data analysis.
[0121] The process of converting unstructured clinical text into discrete semantic encoded sequences is achieved through natural language processing models such as BERT. First, the unstructured clinical text is segmented and embedded to generate vector sequences. Then, it is aligned with structured physiological parameters. For example, the phrase "accelerated heartbeat" mentioned in the unstructured clinical text is matched with heart rate values on the time axis to form a multimodal raw data matrix. In this matrix, rows represent time points and columns represent different modal features, thus ensuring data synchronization.
[0122] Specifically, a medical-domain-tuned BERT model was used for encoding. The BERT model used the "Chinese Guidelines for the Diagnosis and Treatment of Clinical Diseases," the "Medical Subject Headings (MeSH)," core journal medical literature, and electronic medical records from tertiary hospitals as its fine-tuning corpus (total corpus size ≥ 100G, with clinical semantic annotation completed). Fine-tuning was performed using a dual-task approach: Masked Language Model (MLM) + Named Entity Recognition (NER). The masking ratio for the MLM task was 15%, and the fine-tuning hyperparameters were set as follows: learning rate 5e-5, batch size 16, number of training epochs 10, optimizer AdamW, and weight decay coefficient 1e-4. Text segmentation was achieved using Jieba segmentation and a medical professional thesaurus (containing ≥ 50,000 clinical professional terms). The segmentation process was as follows: ① Preprocessing the clinical text to remove punctuation. ① Remove meaningless words, redundant characters, and nouns; ② Load medical professional lexicon for Jieba word segmentation; ③ Filter meaningless words with a length <2; ④ Tag parts of speech and retain only nouns and verbs as clinically valid words; Align the dimensions of discrete semantic coding sequences with the dimensions of structured physiological parameters (512 / 1024 dimensions), and obtain the target dimension by extracting the output vector of the BERT model and mapping it to the target dimension through a linear mapping layer; The temporal alignment of unstructured clinical text coding sequences with structured physiological parameters is based on the patient's clinical examination timestamp, with a unified time granularity of hourly (acute disease) / daily (chronic disease), to achieve a one-to-one match between text coding and physiological parameters at the same timestamp. Missing feature dimensions are filled with 0 vectors, and finally, the time sequence is spliced from early to late according to the timestamp to form a multimodal original data matrix.
[0123] S130. Project the multimodal original data matrix onto the dimension alignment matrix to obtain a unified high-dimensional feature vector.
[0124] A unified high-dimensional feature vector is obtained through the following formula:
[0125] (2)
[0126] In formula (2), To unify high-dimensional feature vectors, For dimension alignment matrix, For the dimension alignment bias term. The control logic of formula (2) is the linear projection and dimension alignment logic of multimodal medical data. Its core value is: 1. Modal fusion: mapping heterogeneous numerical and textual features to the same space to achieve effective fusion of multimodal information; 2. Downstream adaptation: the generated unified high-dimensional features can be directly used for downstream tasks such as similarity calculation, clustering, and classification; 3. Expression enhancement: optimizing feature distribution through learnable parameters to improve the distinguishability and modeling effect of medical data.
[0127] A dimension alignment matrix is used to project the original multimodal data matrix to obtain a unified high-dimensional feature vector. Specifically, the dimension alignment matrix can be constructed based on principal component analysis, mapping the semantic dimension of the text sequence and the numerical dimension of the physiological parameters to a common space. For example, assuming the text vector is 300-dimensional and the physiological parameters are 50-dimensional, the dimension alignment matrix can be used to project the data into a common space. A linear transformation is performed on the vector (350x512) to obtain a 512-dimensional unified vector. This helps to fuse multimodal information, avoid information loss, and thus improve the accuracy of subsequent analysis.
[0128] S140. Perform normalization encoding on the unified high-dimensional feature vector to generate a standardized feature representation, and construct a standardized medical dataset based on the standardized feature representation.
[0129] Normalization encoding is performed using the following formula:
[0130] (3)
[0131] In formula (3), For standardized feature representation, To unify high-dimensional feature vectors The mean, To unify high-dimensional feature vectors The standard deviation. The control logic of formula (3) is the Z-score standardization processing logic of unified high-dimensional features. Its core value is: 1. Scale normalization: mapping multimodal features to the same numerical distribution interval, ensuring fair weighting of each feature dimension in subsequent analysis; 2. Numerical stability: the mean of the standardized features is 0 and the variance is 1, which improves the numerical stability of model training and similarity calculation; 3. Downstream compatibility: the standardized features can be directly used for Euclidean distance, cosine similarity and other calculations, providing reliable input for clustering and risk assessment.
[0132] The process of performing normalization encoding on a uniform high-dimensional feature vector to generate a standardized feature representation includes L2 norm normalization and Z-score normalization. First, the mean and standard deviation of the vector are calculated, such as for each dimension. calculate Then, based on these representations, a standardized medical dataset is constructed, for example, by storing the processed vectors as dataset entries for training predictive models. In this way, standardized medical datasets can support disease diagnosis applications and improve data utilization efficiency.
[0133] Preferably, the intelligent medical assistance method based on big data analysis provided in this embodiment includes step S200 as follows:
[0134] S210. Obtain a standardized medical dataset and calculate the Euclidean distance between features in the standardized medical dataset to determine the feature similarity matrix.
[0135] The feature similarity matrix is determined using the following formula:
[0136] (4)
[0137] In formula (4), For the feature similarity matrix, For the first The sample and the first Euclidean distance derived cosine similarity of each sample The total number of samples in the dataset is given. The control logic of formula (4) is a matrix-based expression logic that standardizes medical features to sample similarity. Its core values are: 1. Structured storage: The similarity between pairs of samples is organized into a square matrix, which facilitates efficient retrieval and calculation; 2. Interpretable similarity: The matrix elements directly correspond to the degree of feature similarity between samples, which facilitates the clinical tracing of similar cases; 3. Algorithm compatibility: It can be directly used as input for algorithms such as density clustering and spectral clustering, which improves the efficiency of medical data analysis.
[0138] No. The sample and the first Euclidean distance derived cosine similarity of each sample This can be derived from the following formula:
[0139] (5)
[0140] In formula (5), For the first The standardized feature vector of each sample. For the first The standardized feature vector of each sample. For the first The standardized feature vector transpose of each sample. The control logic of formula (5) is the cosine similarity calculation logic between standardized medical feature vectors. Its core value is: 1. Direction priority: Unlike Euclidean distance, cosine similarity focuses more on the similarity of feature distribution patterns rather than numerical differences; 2. Normalization guarantee: Through L2 norm normalization, the similarity is limited to the interval [-1, 1], which is convenient for threshold determination and matrix construction; 3. Strong interpretability: The numerical value intuitively reflects the degree of closeness of feature patterns between samples, which is convenient for clinical understanding of the association between similar cases.
[0141] Standardized medical datasets contain gene expression profiles from different patients, with each sample consisting of the expression levels of thousands of genes. Calculating the Euclidean distance between features requires addressing the issue of high-dimensional sparsity. For example, calculating the Euclidean distance between two gene feature vectors A and B requires traversing all dimensions, but many genes have zero expression in specific samples, which may lead to a bias in the distance calculation towards features with more non-zero values. Therefore, preprocessing can be performed before constructing the feature similarity matrix, such as filtering out genes with extremely low variance across all samples to reduce noise interference. Specifically, each element of the similarity matrix... Indicates gene With genes The similarity between them can be expressed by the formula European distance Mapping to the range of 0 to 1 yields a symmetric matrix.
[0142] S220. Construct a feature density space based on the feature similarity matrix and determine the initial cluster centers.
[0143] The initial cluster centers are determined using the following formula:
[0144] (6)
[0145] In formula (6), For the first One initial cluster center, For the first A sample set with initial feature groups, For candidate cluster center vectors, To minimize the parameters, The L2 norm squared. The control logic of formula (6) is the centroid solution logic from the initial feature grouping to the cluster center. Its core value is: 1. Intra-cluster compactness: Solving for the center that minimizes the total dispersion within the cluster, ensuring that the initial cluster center can best represent the feature distribution of the cluster; 2. Algorithm compatibility: It is completely consistent with the center update rules of clustering algorithms such as K-Means, providing a reliable starting point for subsequent iterative clustering; 3. Medical scenario adaptation: Based on standardized medical feature calculation, the center vector can be interpreted as the "typical feature pattern" of this type of case, which is convenient for clinical interpretation.
[0146] The core of constructing a feature density space based on this similarity matrix is to evaluate the density of data points for each feature within its local neighborhood. One implementation is to use the K-nearest neighbor algorithm. For each gene feature, the algorithm identifies the K most similar other genes and calculates the average similarity within that neighborhood as a local density estimate. Initial cluster centers are selected from features whose local density is significantly higher than their neighboring features and which maintain a large distance from even higher-density features. This helps identify gene modules located at the core of the feature space.
[0147] S230. A clustering algorithm is used to iteratively divide the standardized medical dataset based on the initial cluster centers to obtain initial feature groups.
[0148] The initial feature grouping is obtained using the following formula:
[0149] (7)
[0150] In formula (7), For the first The iteration of the ... Grouping by features For the first The iteration of the ... Cluster centers, For the first The iteration of the ... Cluster centers, As an exclusive condition, The L2 norm is used. The control logic of formula (7) is the iterative clustering allocation logic of standardized medical features. Its core values are: 1. Proximity principle: ensuring that each sample belongs to the cluster with the most similar features, improving the homogeneity within the cluster; 2. Iterative optimization: through multiple iterations of "allocating samples → updating centers", it gradually converges to a stable clustering result; 3. Medical scenario adaptation: based on standardized features, the distance is calculated to avoid interference from dimensions and accurately divide the case groups with similar pathological / physiological features.
[0151] No. The iteration of the ... Cluster centers This can be derived from the following formula:
[0152] (8)
[0153] In formula (8), For the first The iteration of the ... Cluster centers, For the first The iteration of the ... The number of samples in each feature group. The control logic of formula (8) is the update logic of the cluster centroid in iterative clustering. Its core value is: 1. Intra-cluster compactness: the cluster center is defined as the mean of the samples in the cluster, ensuring that the center can best represent the feature distribution of the cluster and minimize the intra-cluster dispersion; 2. Iterative convergence: each update makes the total squared distance within the cluster monotonically decrease, ensuring that the algorithm eventually converges to a local optimum; 3. Adaptation to medical scenarios: the mean is calculated based on standardized features, and the center vector can be interpreted as the "typical feature pattern" of this type of case, which is convenient for clinical interpretation of the clustering results.
[0154] When using clustering algorithms for iterative partitioning, such as the improved density peak clustering algorithm, this algorithm assigns each remaining feature point to the cluster of its nearest neighbor with higher density, based on the determined initial cluster centers, thus forming initial feature groups, i.e., preliminary gene co-expression modules.
[0155] Specifically, the clustering algorithm used is the DBSCAN algorithm, and its core parameters are set as follows: the neighborhood radius ε is 0.5~0.8, determined based on the cosine similarity of medical features, 0.5~0.6 for structured data of physiological parameters, and 0.7~0.8 for unstructured data of clinical text; the minimum number of samples MinPts is 5~15, referring to the statistical significance requirements of clinical research, 10~15 for clustering common disease features, and 5~9 for clustering rare disease features; if the DBSCAN algorithm is not suitable due to the low density / high noise of medical data, the improved K-Means++ algorithm is used as an alternative, which is suitable for clustering scenarios with convex set characteristics of feature distribution, noise ratio <10%, and sample size ≥100. The number of clusters K is 5~20, determined by the elbow rule, the number of iterations is 100, the convergence threshold is 1e-4, and clinical feature weights (0.7 for pathological indicators and 0.3 for basic physiological features) are introduced based on the initial selection of cluster centers.
[0156] S240. If the intra-cluster cohesion of the initial feature group is greater than the preset cohesion threshold and the inter-cluster separation of the initial feature group is less than the preset separation threshold, then a merging operation is performed to obtain multiple feature clusters.
[0157] The intra-cluster cohesion of the initial feature group is obtained by the following formula:
[0158] (9)
[0159] In formula (9), For the first Intra-cluster cohesion of each cluster For the first Number of samples in each feature group For the first Number of samples in each feature group For the first Cluster centers, The L2 norm is used. The control logic of formula (9) is the quantitative evaluation logic of the compactness of samples within the cluster. Its core values are: 1. Compactness measurement: It intuitively reflects the degree of clustering of samples within the cluster through the average squared distance, which is the core indicator of clustering quality; 2. Merging decision basis: When When the value exceeds the preset threshold, it indicates that the samples within the cluster are too scattered and can be considered for merging with other similar clusters; 3. Medical scenario adaptation: Based on standardized feature calculation, avoid dimensional interference and accurately assess the consistency of feature patterns within the case group.
[0160] The inter-cluster separation of the initial feature grouping is obtained by the following formula:
[0161] (10)
[0162] In formula (10), For the first Cluster and the first Inter-cluster separation of each cluster, For the first Cluster centers. The control logic of formula (10) is the quantitative evaluation logic of inter-cluster discrimination. Its core value is: 1. Separability measurement: The square of the Euclidean distance between cluster centers intuitively reflects the clarity of the inter-cluster boundary, which is the key basis for judging whether clusters need to be merged; 2. Combined with intra-cluster cohesion: (Intra-cluster cohesion), together constitute the "tight inside, loose outside" evaluation system for cluster quality: low intra-cluster cohesion ( Large) + low inter-cluster separation ( (Small) → Trigger cluster merging; 3. Medical scenario adaptation: Based on standardized feature calculation, avoid dimensional interference and accurately assess the differences in feature patterns between different case groups.
[0163] The set of cluster pairs to be merged is obtained using the following formula:
[0164] (11)
[0165] In formula (11), For the set of cluster pairs to be merged, For the first Number of samples in each feature group; This is a preset agglomeration threshold, used to determine whether a cluster is loose. The preset separation threshold is used to determine whether clusters are too close. The control logic of formula (11) is the cluster pair selection logic for optimizing clustering results. Its core value is: 1. Quantitative merging basis: through the agglomeration threshold. With separation threshold 1. Transform the subjective judgment of "whether to merge" into an executable quantitative rule; 2. Improve clustering quality: only merge cluster pairs that are "loose inside and close outside" to avoid over-merging or omission of merging, and ensure that the final clustering result is "tight inside and loose outside"; 3. Adapt to medical scenarios: accurately identify case groups with ambiguous pathological features, and form more clinically significant feature clusters after merging.
[0166] Intra-cluster cohesion measures the consistency of gene expression patterns within the same group and can be quantified by calculating the average similarity between all gene pairs within the group. Inter-cluster separation assesses the degree of difference between different groups, such as calculating the average distance between the centroids of different clusters. Preset thresholds need to be set empirically or statistically based on the specific data distribution. If an initial group has excessively high cohesion and excessively low separation, it indicates that these groups originate from the same biological process. Therefore, a merging operation is performed to combine overly similar small clusters into larger feature clusters, ultimately resulting in a stable gene set with more clearly defined biological significance for subsequent pathway analysis or biomarker discovery.
[0167] The pre-set threshold for intra-cluster cohesion is 0.6, with 0.7~0.8 for structured data clustering and 0.6~0.7 for unstructured text data clustering; the pre-set threshold for inter-cluster separation is 0.3, with 0.2~0.3 for common disease feature clustering and 0.3~0.4 for rare disease feature clustering. All thresholds are determined based on the statistical regularity of medical feature associations and clinical expert consensus.
[0168] Furthermore, in the intelligent medical assistance method based on big data analysis provided in this embodiment, step S300 includes:
[0169] S310. Obtain the source feature index table constructed from multiple feature clusters.
[0170] The source feature index table records feature cluster information from different medical data sources. For example, feature cluster A comes from genomics data and contains gene expression features; feature cluster B comes from radiomics data and contains texture features. This source feature index table maps each feature to its corresponding data source and feature cluster.
[0171] S320. Generate a cross-source feature co-occurrence matrix based on the source feature index table.
[0172] The cross-source feature co-occurrence matrix is generated using the following formula:
[0173] (12)
[0174] In formula (12), This is a cross-source feature co-occurrence matrix. Features With features co-occurrence frequency, The total number of features. The control logic of formula (12) is the matrix expression logic of cross-source medical feature co-occurrence relationship. Its core value is: 1. Structured storage: Organize the complex cross-source feature co-occurrence relationship into a square matrix, which is convenient for efficient retrieval and calculation; 2. Quantifiable association strength: The higher the co-occurrence frequency, the greater the probability that the two features will appear at the same time in the clinical scenario, and the closer the association; 3. Downstream task input: It can be directly used for downstream medical analysis tasks such as feature weighting, association rule mining, and graph neural network.
[0175] feature With features co-occurrence frequency This can be derived from the following formula:
[0176] (13)
[0177] In formula (13), Features The number of samples that appeared Features The number of samples that appeared Features With features The number of co-occurring samples. The control logic of formula (13) is the quantitative evaluation logic of the co-occurrence relationship of cross-source medical features. Its core value is: 1. Normalized co-occurrence: The influence of the frequency of occurrence of the feature itself is eliminated by the Jaccard coefficient, avoiding the dominance of high-frequency features in the co-occurrence results; 2. Interpretable association strength: It intuitively reflects the "probability of two features occurring at the same time" in the clinical scenario, which is convenient for medical personnel to understand; 3. Cross-source adaptation: It can be applied to both structured physiological parameters and unstructured text features at the same time, and uniformly quantifies the association between multi-source features.
[0178] The construction of the cross-source feature co-occurrence matrix is based on patient sample IDs. For each patient, the features they possess across different data sources are iterated through, and if the features... With features If a feature appears in the same patient's sample, its corresponding position in the cross-source feature co-occurrence matrix is incremented by one. By statistically analyzing all patient samples, a symmetric matrix is ultimately formed, where each element represents the frequency of a feature from a different source co-occurring in the same individual.
[0179] S330. Perform frequent pattern mining on the cross-source feature co-occurrence matrix to extract co-occurrence feature patterns across data sources.
[0180] The set of co-occurring feature patterns across data sources is derived using the following formula:
[0181] (14)
[0182] In formula (14), It is a set of cross-source co-occurrence feature patterns. For feature combination Support The minimum support threshold, For feature combination. The control logic of formula (14) is the mining and screening logic of frequent patterns of cross-source medical features. Its core value is: 1. Frequent pattern extraction: Automatically identify the feature combinations that frequently appear in clinical practice from the massive cross-source feature co-occurrence; 2. Cross-source knowledge discovery: Link the features of different data sources to mine medical knowledge that is difficult to discover by traditional single-source analysis; 3. Configurable threshold: By adjusting the minimum support threshold, balance the "frequency" and "quantity" of the pattern to adapt to different scenario requirements.
[0183] Feature combination support This can be derived from the following formula:
[0184] (15)
[0185] In formula (15), Features To characteristics The number of samples that appear together The total number of samples in the dataset is denoted as . The control logic of formula (15) is the quantitative evaluation logic of the frequency of cross-source feature combinations. Its core values are: 1. Normalized frequency: It converts the number of co-occurring samples into a proportion, eliminates the influence of the dataset size on the results, and facilitates the comparison between different datasets; 2. Frequency measurement: It directly reflects the probability of feature combinations appearing in clinical practice and is the key basis for judging whether they have medical value; 3. Cross-source compatibility: It can be applied to multiple source features such as physiological parameters and clinical texts at the same time, and uniformly quantifies the co-occurrence frequency of feature combinations of arbitrary length.
[0186] Frequent pattern mining is performed on the cross-source feature co-occurrence matrix to discover feature combinations that frequently co-occur in patient samples across data sources. One implementation is to use the FP.Growth (Frequent Pattern Growth) algorithm. This algorithm does not require generating candidate itemsets; instead, it compresses and stores co-occurrence information by constructing an FP (Frequent Pattern) tree and mines frequent itemsets that meet a minimum support threshold. These frequent itemsets are the cross-data-source co-occurrence feature patterns.
[0187] Cross-source co-occurrence feature patterns refer to feature combinations that frequently and stably co-occur in the same batch of samples (such as patient samples) from multiple data sources.
[0188] S340. Calculate the statistical correlation of co-occurrence feature patterns across data sources. If the statistical correlation exceeds the preset correlation judgment threshold, the co-occurrence feature patterns across data sources are judged as related relationships to obtain a set of related relationships.
[0189] The set of association relationships is derived using the following formula:
[0190] (16)
[0191] In formula (16), For a set of relationships, For statistical correlation, The threshold for determining the correlation is preset. The control logic of formula (16) is the logic for determining the validity of cross-source feature patterns. Its core value is: 1. Denoising and screening: Based on frequent patterns, random co-occurring noise patterns are filtered out through statistical correlation to improve the reliability of the results; 2. Quantification of correlation strength: The dependency relationship between features is quantified by statistical indicators to make the "correlation" judgment more objective and reproducible; 3. Enhancement of clinical value: Only feature patterns that are both common and statistically significant are retained, which are more likely to correspond to real clinical pathological associations.
[0192] Statistical correlation This can be derived from the following formula:
[0193] (17)
[0194] In formula (17), Features With features covariance, Features covariance, Features The covariance of formula (17) is the control logic of the quantitative evaluation logic of the linear correlation strength between cross-source medical features. Its core value is: 1. Normalized correlation: By dividing by the standard deviation, the influence of feature dimensions and scale is eliminated, so that the correlation between different features can be directly compared; 2. Linear correlation measurement: It accurately captures the linear cooperative change trend between features, which is suitable for the correlation analysis of continuous medical features such as physiological indicators; 3. Cross-source compatibility: It can be applied to both structured physiological parameters (such as blood pressure and heart rate) and numerical clinical text features, and uniformly quantifies cross-source correlation.
[0195] The calculation of statistical correlation aims to assess the strength of the associations within the mined feature patterns, rather than simply the frequency of co-occurrence. For example, for a co-occurrence pattern containing genomic feature X and imaging feature Y, its point mutual information or chi-square test value is calculated. This calculation process considers the marginal probabilities of the occurrence of features X and Y individually, as well as their joint probability, thereby quantifying whether their co-occurrence is significantly higher than random expectations. A preset correlation threshold needs to be set according to the specific business context. When the calculated correlation metric exceeds this threshold, the co-occurring feature pattern across data sources is determined to have a statistically significant association and included in the final association set. This helps to reveal cross-modal biological connections.
[0196] The determination basis and adjustment rules for the preset correlation judgment thresholds are as follows: mutual information entropy threshold: 0.3~0.5, based on the statistical analysis of the correlation strength of medical features and the standard of effective correlation entropy value of variables in "Medical Statistics", with 0.4~0.5 for common diseases and 0.3~0.4 for rare diseases; support threshold: 5%~20%, referring to the frequency requirements of clinical effective patterns, with 10%~20% for large-scale sample data (≥10,000 cases) and 5%~9% for small sample data (<1,000 cases); confidence threshold: 0.6~0.9, referring to the credibility requirements of clinical diagnosis, with 0.8~0.9 for diagnostic feature correlation and 0.6~0.7 for prognostic feature correlation; all thresholds are formulated in combination with clinical diagnosis and treatment guidelines, statistical analysis of datasets and clinical expert consensus.
[0197] Preferably, in the intelligent medical assistance method based on big data analysis provided in this embodiment, step S400 includes:
[0198] S410. Extract clinical representation nodes based on the set of association relationships.
[0199] The association set includes cross-modal association pairs such as "gene mutation P and radiographic ground-glass nodule feature Q" and "serum biomarker R and electrocardiogram ST segment changes S". Clinical representation nodes are observational indicators with clear clinical significance abstracted from these association pairs. For example, from the association between "gene mutation P and radiographic ground-glass nodule feature Q", "ground-glass nodules carrying the P mutation" can be extracted as a clinical representation node, which integrates molecular and imaging information and represents a specific disease subtype.
[0200] S420. If the state transition probability between clinical representation nodes is greater than the preset transition threshold, then connect the clinical representation nodes to construct a disease progression prediction path.
[0201] The following formula is used to construct a disease progression prediction path:
[0202] (18)
[0203] In formula (18), To predict the path of disease progression, Clinical Representation Nodes To clinical manifestation node The state transition probability, The preset transfer threshold is used. The control logic of formula (18) is the path-based expression logic of the clinical manifestation time-series evolution law. Its core value is: 1. High probability path extraction: Automatically identify high-frequency disease evolution patterns from massive clinical time-series data and filter low probability noise; 2. Strong interpretability: The path is directly composed of clinical manifestation nodes, and doctors can intuitively understand the key links in the development of the disease; 3. Predictive assistance: Provide a forward-looking basis for clinical decision-making and intervene in high-risk evolution paths in advance.
[0204] Clinical Representation Nodes To clinical manifestation node State transition probability This can be derived from the following formula:
[0205] (19)
[0206] In formula (19), To start from clinical manifestation nodes Shift to clinical representation nodes The number of samples, For passing through clinical manifestation nodes The total number of samples. The control logic of formula (19) is the quantitative assessment logic of the probability of the temporal evolution of clinical representations. Its core values are: 1. Conditional probability modeling: transforming temporal transitions into conditional probabilities to accurately depict the evolutionary pattern of "current representation → subsequent representations"; 2. Frequency estimation: directly estimating probabilities through sample counting, which is computationally efficient and highly interpretable; 3. Cross-source compatibility: applicable to multiple sources of clinical representations such as physiological indicators and clinical texts, and uniformly quantifying the intensity of temporal transitions.
[0207] The calculation of state transition probability relies on the analysis of longitudinal follow-up data of patients. The intelligent medical assistance system tracks the clinical manifestation node states of a large number of patients at different time points. For example, the proportion of patients whose state "ground-glass nodules carrying P mutations" changes to "elevated serum marker R with increased solid components" in the next follow-up cycle is defined as the state transition probability between these two nodes. If this state transition probability exceeds a preset transition threshold, such as 0.6, the disease evolution path is considered highly typical, and the two nodes are connected to construct a disease development prediction path. In this embodiment, the preset transition threshold is 0.6, determined based on the statistical results of clinical follow-up data; for acute diseases, it is 0.5~0.6, and for chronic diseases, it is 0.6~0.7.
[0208] S430. Traverse the disease progression prediction path to extract disease stage characteristics, and obtain trend indicators based on the disease stage characteristics.
[0209] Trend indicators are derived using the following formula:
[0210] (20)
[0211] In formula (20), As a trend indicator, These are the weighting coefficients. This represents the characteristic changes in the disease course. The duration of the disease stage. The control logic of formula (20) is the smoothing trend assessment logic of the time sequence changes of the disease stage characteristics. Its core value is: 1. Smoothing filtering: through trend indicators 1. Smooths single-feature fluctuations, avoiding trend misjudgment caused by noise; 2. Adjustable weights: through... Flexible adaptation to different scenarios (such as emergency scenarios can improve efficiency) By capturing mutations, chronic disease management can reduce [the risk of disease]. 3. Temporal continuity: It conforms to the gradual pattern of disease evolution and is more suitable for assisting doctors in judging the long-term course of the disease.
[0212] By traversing a complete disease progression prediction path, the system extracts disease stage characteristics that characterize the disease's evolution. For example, a disease progression prediction path might sequentially pass through nodes such as "ground-glass nodules," "increased solid components," and "enlarged lymph nodes." The intelligent medical assistance system analyzes the evolution rate, order, and severity gradient of the features represented by these nodes, thereby obtaining quantitative indicators such as the "invasive progression trend index." This trend index is used to synthesize a virtual time-series evolution trajectory, the direction of which is correlated with severity.
[0213] S440. If the time-series evolution trajectory generated based on the trend indicators conforms to the preset deterioration direction, then the disease development trend sequence is determined.
[0214] The disease progression trend sequence can be determined using the following formula:
[0215] (twenty one)
[0216] In formula (21), This is a sequence showing the progression of the disease. The direction of change of trend indicators The control logic of formula (21) is the directional screening logic for the evolution path of high-risk diseases. Its core values are: 1. Risk focus: accurately extracting the path pointing to deterioration from all predicted paths and avoiding interference from irrelevant information; 2. Early warning guidance: directly serving clinical risk early warning and helping doctors to intervene in high-risk disease courses in advance; 3. Strong interpretability: the paths and trend directions in the sequence correspond one-to-one, which is convenient for doctors to understand and verify.
[0217] The intelligent medical assistance system pre-sets deterioration direction templates such as "from local to spread" and "from inert to invasive". If the generated trajectory has a high degree of matching with the pre-set deterioration direction template in terms of morphology, for example, if the angle between its principal component analysis direction and the pre-set deterioration vector direction is less than 10 degrees, then the path is determined to constitute an effective disease development trend sequence for prognostic early warning.
[0218] Furthermore, in the intelligent medical assistance method based on big data analysis provided in this embodiment, step S500 includes:
[0219] S510. Extract abnormal physiological features from the disease progression trend sequence to calculate risk indicators.
[0220] The risk indicator is derived using the following formula:
[0221] (twenty two)
[0222] In formula (22), As a risk indicator, These are the weighting coefficients. This is an abnormal physiological feature vector. As a baseline normal physiological feature vector, The rate of change of abnormal characteristics. The control logic of formula (22) is the comprehensive risk quantification assessment logic of abnormal physiological characteristics. Its core values are: 1. Multi-dimensional integration: It considers both "static deviation" and "dynamic change" to avoid risk misjudgment caused by a single dimension; 2. Adjustable weights: Through 3. Clinically interpretable: Risk indicators directly correspond to the severity and rate of deterioration of abnormal characteristics, making it easier for doctors to quickly determine the urgency of the condition.
[0223] The extraction of abnormal physiological features relies on in-depth analysis of the pathological states represented by nodes in a disease progression sequence. For example, a disease progression sequence might describe the evolution from "mild interstitial lung thickening" to "extensive honeycomb-like changes" and then to "significantly elevated pulmonary artery pressure." The intelligent medical assistance system analyzes the corresponding clinical data for each node in the sequence, extracting quantifiable abnormal physiological parameters such as "the proportion of honeycomb-like changes" and "the specific value of pulmonary artery systolic pressure." These abnormal physiological parameters collectively form the basis for calculating risk indicators. Specifically, the calculation of risk indicators integrates the deviations of multiple abnormal physiological features. For example, the intelligent medical assistance system presets a baseline range of physiological parameters for a healthy population. For the extracted "proportion of honeycomb-like changes" and "pulmonary artery systolic pressure," it calculates the multiple or standard deviation multiple exceeding the upper limit of the baseline range, respectively. Subsequently, a weighted summation formula is used to merge the deviations of these two dimensions into a comprehensive risk indicator.
[0224] S520. If the risk indicator is greater than the preset risk level threshold, the node corresponding to the risk indicator will be marked as a high-risk node.
[0225] The set of high-risk nodes is derived using the following formula:
[0226] (twenty three)
[0227] In formula (23), A set of high-risk nodes For nodes, To preset the risk level threshold, For nodes The corresponding risk indicators. The control logic of formula (23) is the precise screening logic for high-risk clinical nodes. Its core values are: 1. Risk focus: extracting high-risk objects from all nodes to avoid low-risk information interfering with clinical decision-making; 2. Threshold controllability: by adjusting 3. Clinically operable: The output set of high-risk nodes can be directly transformed into visual early warning or intervention suggestions to improve diagnosis and treatment efficiency.
[0228] If the risk indicator exceeds the preset risk level threshold, for example, exceeding the healthy baseline by 2.5 standard deviations, the current node is determined to be a high-risk node, such as "extensive honeycomb changes". In this embodiment, the determination basis and adjustment rules of the preset risk level threshold are as follows: the indicator deviation threshold is ±15% to ±30%, based on the normal reference range of physiological parameters in "Clinical Laboratory Basics", with ±15% to ±20% for critical care scenarios and ±25% to ±30% for routine physical examination scenarios; the trend acceleration threshold is 0.05 to 0.1, based on the statistical change rate of medical time series data and the prediction error range of LSTM model, with 0.05 to 0.07 for malignant tumors and cardiovascular emergencies, and 0.08 to 0.1 for the stable period of chronic diseases; all thresholds are formulated in combination with clinical diagnosis and treatment guidelines, dataset statistics and clinical expert consensus.
[0229] S530: Obtain the deterioration rate corresponding to high-risk nodes to extract multiple risk factors.
[0230] The rate of deterioration is obtained by analyzing the occurrence time and evolution duration of the high-risk node in the historical trend sequence. For example, the intelligent medical assistance system retrospectively found that the evolution from "mild interstitial lung thickening" to the high-risk node of "extensive cellular changes" took an average of 24 months in most patients, but only 6 months in a few rapidly progressing patients. The intelligent medical assistance system extracts potential factors leading to rapid progression, such as "coexistence of specific autoantibodies," "baseline carbon monoxide diffusion capacity below 40% of the predicted value," and "high-resolution CT showing lesions predominantly distributed in the subpleural region." These factors are then extracted as multiple risk factors.
[0231] S540. Multiple risk factors are fused using a weighting matrix to obtain the potential risk value.
[0232] The potential risk value is calculated using the following formula:
[0233] (twenty four)
[0234] In formula (24), This represents the potential risk value. Assign a weight matrix, For risk factor vectors, For the first The weights of each risk factor For the first The values of each risk factor, The total number of risk factors. The control logic of formula (24) is the screening and judgment logic of high-risk clinical nodes. Its core values are: 1. Risk focus: accurately extracting high-risk objects from all nodes and avoiding interference from low-risk information; 2. Threshold controllability: by adjusting It allows for flexible control of the severity of early warnings, adapting to different departments / disease scenarios; 3. Clinically operable: High-risk node sets can be directly converted into visual early warnings or intervention suggestions, improving diagnostic and treatment efficiency.
[0235] A weighted matrix is used to quantify the contribution of different risk factors to the rate of deterioration. This weighted matrix is constructed based on multiple regression analysis of a large patient cohort. For example, the analysis results show that "positive for a specific autoantibody" contributes 0.5 weight to shortening the progression time, "low baseline lung function" contributes 0.3 weight, and "subpleural distribution" contributes 0.2 weight. The intelligent medical assistance system uses this matrix to weight and fuse the aforementioned risk factors present in the current patient. If the patient possesses all three factors, their potential risk value is calculated as 0.5 + 0.3 + 0.2 = 1.0. If they only possess the first two factors, the potential risk value is 0.8.
[0236] S550. Sort the potential risk values in descending order to generate a risk assessment list.
[0237] The risk assessment list is derived using the following formula:
[0238] (25)
[0239] In formula (25), For the risk assessment list, The function is a descending sorting function based on potential risk values. The control logic of formula (25) is the priority sorting logic of high-risk nodes. Its core values are: 1. Risk stratification: Quantifying and sorting high-risk nodes according to their degree of danger, clarifying the priority of diagnosis and treatment; 2. Clinical orientation: The descending sorting directly serves the clinical need of "treating the highest risk first" and improves the efficiency of treatment; 3. Strong interpretability: The list order corresponds one-to-one with the potential risk values, which is convenient for doctors to understand and verify.
[0240] Ultimately, the intelligent medical assistance system sorts all high-risk nodes in descending order based on the calculated potential risk values, generating a clear risk assessment list to guide the direction of clinical priority interventions.
[0241] Preferably, in the intelligent medical assistance method based on big data analysis provided in this embodiment, step S600 includes:
[0242] S610. Obtain potential risk values and pathological description information from the risk assessment list.
[0243] The intelligent medical assistance system first extracts potential risk values and pathological descriptions from the risk assessment list. For example, for a patient with pulmonary fibrosis, the risk list can be used to obtain potential risk values such as 0.7 and pathological descriptions such as "honeycomb-like changes with pulmonary hypertension".
[0244] S620. Retrieve medical knowledge base based on pathological description information to construct a set of reference items to be sorted.
[0245] Based on the pathological description information, a medical knowledge base is retrieved to construct a set of reference items to be ranked. Specifically, the medical knowledge base includes standardized clinical guidelines and literature databases. The intelligent medical assistance system retrieves relevant entries through keyword matching, such as "honeycomb changes" and "pulmonary hypertension," thereby forming a set of reference items to be ranked. This set includes treatment options such as "anti-fibrotic drugs," monitoring recommendations such as "regular pulmonary function testing," and prognostic indicators such as "survival rate estimation." The retrieval process involves natural language processing technology, where the intelligent medical assistance system parses the semantic structure of the pathological description. For example, it maps "honeycomb changes" to pathological classification nodes in the knowledge base and expands the retrieval to related concepts such as "interstitial lung disease" to ensure the comprehensiveness of the set. It should be noted that the construction of this set of reference items to be ranked also considers patient-specific factors, such as age and past medical history, and filters irrelevant items to exclude them, thereby obtaining approximately 5-10 reference items to be ranked.
[0246] In this embodiment, the medical knowledge base is constructed as follows: It employs an ontology + structured database approach, building a medical ontology model based on OWL (Web Ontology Language) to define core ontology concepts such as diseases, symptoms, and laboratory indicators, as well as causal, diagnostic, and treatment relationships. Knowledge is extracted from authoritative sources using NLP (Natural Language Processing) technology and manual review. After review by at least two senior-level clinical physicians, the knowledge is stored in a dual database of MySQL and Neo4j. MySQL stores structured attribute knowledge, and Neo4j stores knowledge relationships. The data sources for the medical knowledge base include national / industry clinical practice guidelines, the 9th edition of undergraduate medical textbooks and classic monographs, core journal / SCI papers published within the last 5 years with an impact factor ≥ 3.0, and authoritative and traceable medical resources such as national drug / testing standards. The treatment suggestion database is based on a medical knowledge base and is constructed according to a hierarchical structure of disease-risk level-treatment suggestion. It extracts treatment-related suggestions from the medical knowledge base and categorizes them into prevention, diagnosis, treatment, rehabilitation, and follow-up. After being associated with the risk levels (Levels I to V) of this invention and verified by a clinical expert committee, it is stored in a Redis database and indexed by patient ID-disease type-risk level. The medical knowledge base and the treatment suggestion database have a one-way dependency relationship, with the medical knowledge base providing the theoretical basis for the treatment suggestion database, and its updates synchronously triggering updates to the treatment suggestion database. The knowledge base uses automatic updates. The system employs a dual mechanism of new content and manual review. Monthly updates are automatically generated and stored in a review database by web crawlers, scraping the latest knowledge from authoritative official websites. Quarterly manual reviews by an expert committee ensure formal inclusion. Emergency updates are completed within 7 working days of the release of new national clinical guidelines or significant medical research findings. Semantic matching between pathological descriptions and the knowledge base utilizes a two-layer algorithm: medical word vector similarity and ontology relation matching. The first layer encodes the pathological descriptions and knowledge base entries using a fine-tuned BERT model, calculates cosine similarity, and filters candidate entries ≥0.7. The second layer extracts… The core ontology concepts and relationships of both parties are structurally matched. The comprehensive matching score is calculated as ontology relationship matching degree × 0.6 + word vector similarity × 0.4. Items with a score ≥ 0.8 are selected as the final result. The matching results are sorted in order of priority: clinical guidelines > classic monographs > core literature. Items with the same priority are sorted in descending order of comprehensive score. The knowledge base and standardized medical data are associated through clinical feature keywords + individual patient characteristics. Core clinical feature keywords are extracted from the medical data and semantically matched. The matching results are then filtered based on individual characteristics such as patient age, gender, complications, and allergy history to obtain a set of reference items to be sorted.
[0247] S630. Quantitatively analyze the set of reference items to be ranked using potential risk values to obtain severity and urgency values.
[0248] The severity value is calculated using the following formula:
[0249] (26)
[0250] In formula (26), This is a numerical value indicating the severity. Weighting coefficients The urgency coefficient is the pathological description. The control logic of formula (26) is a multi-dimensional fusion assessment logic for the severity of the illness. Its core values are: 1. Dual-dimensional fusion: combining quantitative risk values with clinical semantic urgency to avoid bias in single-dimensional assessment; 2. Adjustable weights: through 3. Clinically interpretable: The severity value directly corresponds to "risk + urgency", which is convenient for doctors to understand and make decisions.
[0251] The urgency level is calculated using the following formula:
[0252] (27)
[0253] In formula (27), This is an urgency level numerical value, describing how urgent the event / task needs to be handled within a short period of time; This is a weighting coefficient used to adjust the relative importance of severity and urgency. For the optimal treatment window duration, The probability of complications is given. The control logic of formula (27) is to obtain the urgency value by weighted fusion of the optimal treatment window duration, the probability of complications, and the weight coefficient. It realizes the dual-dimensional fusion of treatment time window and complication risk, and can objectively and quantitatively determine the urgency of the condition, providing a basis for subsequent priority ranking of diagnosis and treatment.
[0254] The set of reference items to be ranked is quantitatively analyzed using potential risk values to obtain severity and urgency values. In one embodiment, the severity value is calculated by multiplying the potential risk value by the inherent severity coefficient of the reference item. For example, for the reference item "antifibrotic drugs," the severity coefficient is preset to 0.8, so the severity value is 0.7 × 0.8 = 0.56. The urgency value is based on the square root of the risk value combined with a time-sensitive factor. For example, urgency = √0.7 × 1.2 ≈ 0.94, where the time-sensitive factor reflects the rate of pathological progression.
[0255] S640. Calculate the ranking weight based on the severity value and the urgency value.
[0256] The ranking weights are derived using the following formula:
[0257] (28)
[0258] In formula (28), For sorting weights, , where is the weighting coefficient. The control logic of formula (28) is to transform the two independent dimensions of "inherent harm of the event (severity)" and "time urgency (urgency)" into a single comparable ranking index through an adjustable linear weighting method, thereby realizing the automatic priority ranking of multiple events / tasks. 1. It retains the independent information of the two core dimensions, and through It achieves flexible adaptation to business scenarios; 2. It is simple to calculate, highly interpretable, and easy to implement in engineering and for manual verification.
[0259] The ranking weight is calculated based on the severity and urgency values. Specifically, the ranking weight uses a weighted average formula, such as ranking weight = 0.6 × severity + 0.4 × urgency. For the aforementioned values, this would be 0.6 × 0.56 + 0.4 × 0.94 = 0.712.
[0260] S650. Sort the set of reference items to be sorted in descending order according to the sorting weight to obtain an ordered sequence of reference items.
[0261] The ordered reference sequence is derived using the following formula:
[0262] (29)
[0263] In formula (29), For an ordered sequence of reference items, This is the set of reference items to be sorted. The control logic of formula (29) is to calculate the sorting weights obtained from multi-dimensional comprehensive calculations. This transforms the data into a directly executable linear processing sequence. 1. It achieves a mapping from "quantified priority score" to "operable processing order," ensuring that high-priority reference items are processed first. 2. The descending order rule guarantees consistency of business objectives: objects with higher overall risk / urgency receive processing resources earlier. 3. It forms a closed loop with the preceding weight calculation formula, fully supporting the entire decision-making process of "severity - urgency - overall weight - priority ranking."
[0264] The set of reference items to be sorted is arranged in descending order according to the sorting weight to obtain an ordered sequence of reference items. For example, the "anti-fibrotic drugs" with the highest weight is placed first, followed by "regular lung function tests".
[0265] S660. Determine the display order of each reference item according to the ordered reference item sequence and generate a comprehensive report. The comprehensive report is generated by encapsulating the ordered reference item sequence into a structured template.
[0266] Generate a comprehensive report using the following formula:
[0267] (30)
[0268] In formula (30), For comprehensive report, To encapsulate functions for templates, This is a structured report template. The control logic of formula (30) is to transform the "ordered priority sequence" into a "standardized, deliverable structured report", completing the closed loop from decision data to business output: 1. It realizes the visualization and standardization of the sorting results, which is convenient for manual review, auditing and cross-departmental transmission. 2. The templated encapsulation ensures the consistency and reusability of the report format, avoiding errors and inefficiencies in manual typesetting. 3. It forms a complete link with the preceding steps: severity / urgency → comprehensive weight → ordered sequence → structured report, supporting fully automated decision-making and delivery.
[0269] The system determines the display order of each reference item according to the ordered reference item sequence and generates a comprehensive report. This comprehensive report is generated by encapsulating the ordered reference item sequence into a structured template, which includes, for example, a title, a priority list, and a detailed description, thus forming a printable clinical guidance document. Through this process, the system achieves an ordered presentation of risks.
[0270] Furthermore, in the intelligent medical assistance method based on big data analysis provided in this embodiment, step S700 includes:
[0271] S710. Parse the ordered reference item sequence in the comprehensive report and map the ordered reference item sequence to the diagnosis and treatment suggestion library to generate initial auxiliary judgment results.
[0272] The initial auxiliary judgment result is generated using the following formula:
[0273] (31)
[0274] In formula (31), This is the initial auxiliary judgment result. For the reference item-treatment suggestion mapping function, For an ordered sequence of reference items, This serves as a treatment suggestion library. The control logic of formula (31) combines a structured, ordered priority sequence with a domain knowledge base to generate auxiliary judgment results that can directly guide business operations: 1. It realizes the transformation from "priority ranking" to "business actionable suggestions," allowing the ranking results to directly serve clinical decision-making / business handling. 2. It maintains order consistency, ensuring that treatment suggestions for high-priority objects are given priority attention, in line with the business principle of "urgent before serious, high before low." 3. It forms a complete closed loop with the preceding steps: severity / urgency → comprehensive weight → ordered sequence → structured report → auxiliary judgment results, supporting fully automated decision-making.
[0275] The intelligent medical assistance system first parses the ordered reference item sequence from the comprehensive report. For example, in a risk report for a patient with lung disease, the ordered reference item sequence, such as "anti-fibrosis treatment," is extracted and placed first, followed by "oxygen therapy support." For instance, initial auxiliary judgment results are generated by mapping the ordered reference item sequence to a treatment suggestion database. Specifically, the treatment suggestion database is a database containing clinical guidelines and expert consensus. The intelligent medical assistance system uses a semantic matching algorithm to map the sequence item, such as "anti-fibrosis treatment," to relevant suggestions in the database, for example, mapping it to "pirfenidone drug application guidelines," thus forming an initial judgment result, including preliminary treatment plans such as drug dosage and monitoring frequency. This mapping process involves keyword extraction and similarity calculation. The intelligent medical assistance system parses the semantic structure of the sequence item; for example, it decomposes "oxygen therapy support" into "oxygen supply" and "respiratory monitoring" sub-items and retrieves matching entries in the database to ensure the comprehensiveness and accuracy of the results. This retrieval database also integrates patient historical data and uses a filtering mechanism to exclude inapplicable suggestions. For example, if a patient has a history of heart disease, it avoids mapping to high-risk treatment options, thereby generating an initial set containing 3-5 judgment results. This approach enhances the reliability of judgments by dynamically linking reference items with standardized knowledge and provides initial guidance in clinical decision-making.
[0276] S720: Collect real-time monitoring values associated with the initial auxiliary judgment results and convert them into new data feature vectors.
[0277] Real-time monitoring values associated with the initial assisted judgment results are collected and converted into new data feature vectors. In one embodiment, the intelligent medical assistance system acquires data from wearable devices or hospital monitors, such as the patient's real-time blood oxygen saturation and heart rate fluctuations, for example, blood oxygen saturation of 95% and heart rate of 80 beats / min, and then converts these values into feature vectors, such as [0.95, 80], to quantify the physiological state.
[0278] The collection frequency of the new data is set according to the clinical scenario: 5-15 minutes / time in the intensive care unit (ICU), 1-2 hours / time in general wards, and 1-7 days / time in outpatient follow-up. The collection scope is consistent with the initial multi-source medical data type, including real-time structured physiological parameters and newly added unstructured clinical text. The new data cleaning rules are as follows: ① When the proportion of missing values is <30%, interpolation is used to complete them, and values ≥30% are removed. ② Outliers are filtered based on the clinical normal reference range and the 3σ principle. After confirmation by clinicians, real data is retained and monitoring errors are removed / corrected. ③ Duplicates are removed by timestamp + patient ID. ④ The format, units, and coding rules are consistent with the initial standardized medical dataset.
[0279] S730. Based on the deviation values between the new data feature vector and the benchmark feature vector, the initial auxiliary judgment results are reweighted to obtain the corrected reference item sequence.
[0280] The deviation between the new data feature vector and the baseline feature vector is obtained by the following formula:
[0281] (32)
[0282] In formula (32), The deviation value represents the degree of deviation between the new data and the baseline data; the larger the value, the more significant the deviation. The new data feature vector represents the real-time feature representation of the current reference item to be processed. The baseline feature vector is the representation of the baseline features in the history / standard / knowledge base (such as typical cases, normal physiological state vectors). The control logic of formula (32) is to quantify the degree of deviation between the new data and the baseline features through the relative distance in the vector space, and provide an objective numerical basis for subsequent correction of auxiliary judgment results and reordering of reference item priorities: 1. It realizes the transformation from "feature vector difference" to "interpretable deviation degree", avoiding the degree of deviation of human subjective judgment. 2. The normalization design ensures the comparability and universality of deviation degree under different scenarios and different dimensional features. 3. It provides quantitative input for dynamic correction of initial judgment results, and improves the robustness and adaptability of the auxiliary decision system.
[0283] The revised reference term sequence is obtained using the following formula:
[0284] (33)
[0285] In formula (33), The final output is a revised sequence of treatment recommendations / priorities, which serves as the revised reference item sequence. The deviation influence coefficient is used to adjust the correction strength of the deviation degree to the initial result, and is preset by the business scenario. The control logic of formula (33) is to achieve adaptive weighted correction of the initial auxiliary judgment result through a linear combination of the deviation degree and the influence coefficient: 1. It quantifies the impact of the "abnormality of new data" on the "credibility of initial judgment", avoiding decision-making errors due to data deviation from the benchmark. 2. It retains the business value of the initial result, only reasonably attenuates abnormal data, and balances system stability and dynamic adaptability. 3. It provides a quantitative basis for subsequent reordering of reference item sequences, ensuring that reference items with high credibility and low deviation receive priority in processing resources.
[0286] The initial auxiliary judgment result is reweighted based on the deviation between the new data feature vector and the baseline feature vector to obtain a corrected reference item sequence. Specifically, the baseline feature vector is preset based on historical health data, such as the ideal vector [0.98, 70] for lung disease patients. The system calculates the deviation, such as the Euclidean distance value of 0.15, and then uses this value to adjust the weights. For example, the weight of "anti-fibrosis treatment" in the initial result is reduced from 0.8 to 0.68, thereby reordering the sequence and giving higher priority to more urgent items such as "oxygen therapy support". This reweighting process involves deviation analysis, in which the intelligent medical assistance system evaluates the dimensional differences of each vector, such as blood oxygen deviation leading to treatment adjustments, and combines clinical thresholds to filter outliers to achieve dynamic correction of the sequence. Through this mechanism, the intelligent medical assistance system ensures that the judgment result adapts to real-time changes, thereby optimizing the accuracy of decision-making in medical practice.
[0287] The preset threshold for the deviation between the new data feature vector and the baseline feature vector is 0.2, which is determined based on the normalized range [0, 1] of the standardized medical feature vector. For real-time monitoring scenarios, the threshold is 0.1~0.2, and for routine follow-up scenarios, the threshold is 0.2~0.3.
[0288] S740 encapsulates the corrected reference item sequence and outputs dynamically optimized auxiliary judgment results.
[0289] The auxiliary judgment result of dynamic optimization is obtained through the following formula:
[0290] (34)
[0291] In formula (34), To provide dynamically optimized auxiliary judgment results, This is a function encapsulation. The control logic of formula (34) is to transform the modified dynamic priority sequence into a standardized and deliverable final auxiliary judgment result, completing the final closed loop from "decision data" to "business output": 1. It realizes the visualization and standardization of the modified result, which is convenient for manual review, auditing and cross-departmental transmission. 2. The templated encapsulation ensures the consistency and reusability of the report format, avoiding errors and inefficiencies in manual typesetting. 3. It forms a complete link with the preceding steps: severity / urgency → comprehensive weight → ordered sequence → initial judgment → deviation correction → final encapsulation, supporting fully automated decision-making and delivery.
[0292] The revised reference item sequence is encapsulated and outputs dynamically optimized auxiliary judgment results, such as a visualized report, for physician reference. Through this process, the intelligent medical assistance system achieves real-time optimization of treatment recommendations.
[0293] Generated after packaging As shown in Table 1:
[0294] Table 1
[0295]
[0296] Please see Figure 2This embodiment provides an intelligent medical assistance system based on big data analysis, used to execute the aforementioned intelligent medical assistance method based on big data analysis. It includes a standardized medical dataset acquisition module 10, a multiple feature cluster acquisition module 20, a correlation set acquisition module 30, a disease development trend sequence determination module 40, a risk assessment list generation module 50, a display order determination module 60, and an auxiliary judgment result output and dynamic optimization module 70. The standardized medical dataset acquisition module 10 acquires multi-source medical data and converts it into a unified-dimensional feature representation through a data processing module to obtain a standardized medical dataset. The multiple feature cluster acquisition module 20 uses a clustering algorithm to group the features in the standardized medical dataset to obtain multiple feature clusters. The correlation set acquisition module 30 extracts co-occurrence features across data sources from the multiple feature clusters. The system employs a multi-level marketing model (MLM) to identify disease progression patterns and determine correlations when their statistical correlation exceeds a preset threshold. A disease progression trend sequence determination module 40 is used to construct a disease progression prediction path based on the correlation sequence using a prediction model, and obtains trend indicators by traversing path nodes to determine the disease progression trend sequence. A risk assessment list generation module 50 is used to select nodes from the disease progression trend sequence whose risk indicators exceed a preset level as high-risk nodes, and integrates multiple risk factors to calculate potential risk values to generate a risk assessment list. A display order determination module 60 is used to generate a comprehensive report based on the risk assessment list, and uses a priority sorting method to determine the display order of each reference item in the comprehensive report. An auxiliary judgment result output and dynamic optimization module 70 is used to output auxiliary judgment results based on the comprehensive report, and incorporates new data through a data update module to dynamically optimize the auxiliary judgment results.
[0297] The beneficial effects of the intelligent medical assistance method and system based on big data analysis provided in this embodiment are as follows:
[0298] 1. Through "multi-source data standardization → feature clustering → correlation mining → disease prediction path construction → risk assessment → report generation → dynamic optimization", a closed-loop intelligent medical assistance system has been formed, covering the entire process from data collection to assisted diagnosis.
[0299] 2. Achieve unified processing of multi-source data (adapting to the heterogeneity of medical data), accurate identification of cross-source feature associations (defining association strength through thresholds), and dynamic risk assessment and report optimization (achieving timeliness and accuracy in assisted diagnosis).
[0300] 3. The entire process uses clearly defined numerical ranges and thresholds (such as feature vector dimensions of 512~1024 dimensions and risk values of 0~100 points) to ensure the engineering feasibility of the method, effectively solving the pain points of low efficiency, data fragmentation, and slow response in traditional medical auxiliary diagnosis, and is suitable for intelligent auxiliary diagnosis and treatment scenarios for various diseases.
[0301] Although preferred embodiments of the invention have been described, those skilled in the art, upon learning the basic inventive concept, can make other changes and modifications to these embodiments. Therefore, the appended claims are intended to be interpreted as including both the preferred embodiments and all changes and modifications falling within the scope of the invention. Clearly, those skilled in the art can make various alterations and modifications to the invention without departing from its spirit and scope. Thus, if these modifications and modifications of the invention fall within the scope of the claims and their equivalents, the invention is also intended to include these modifications and modifications.
Claims
1. A smart medical assistance method based on big data analysis, characterized in that, Includes the following steps: S100. Acquire multi-source medical data and convert the multi-source medical data into a unified dimension feature representation through a data processing module to obtain a standardized medical dataset. S200. Use a clustering algorithm to group the features in the standardized medical dataset to obtain multiple feature clusters; S300: Extract co-occurrence feature patterns across data sources from multiple feature clusters, and determine them as associations when their statistical correlation exceeds a preset threshold to obtain a set of associations; S400. Based on the set of relationships, a prediction model is used to construct a disease development prediction path, and trend indicators are obtained by traversing the path nodes to determine the disease development trend sequence. S500: Select nodes whose risk indicators exceed a preset level from the disease development trend sequence as high-risk nodes, and integrate multiple risk factors to calculate potential risk values to generate a risk assessment list. S600. Generate a comprehensive report based on the risk assessment list, and determine the display order of each reference item in the comprehensive report using a priority sorting method; S700: Output the auxiliary judgment result based on the comprehensive report, and incorporate new data through the data update module to dynamically optimize the auxiliary judgment result; Step S400 includes: S410. Extract clinical representation nodes based on the set of association relationships; S420. If the state transition probability between the clinical representation nodes is greater than a preset transition threshold, then connect the clinical representation nodes to construct a disease progression prediction path. S430. Traverse the disease progression prediction path to extract disease stage features, and obtain trend indicators based on the disease stage features; S440. If the time-series evolution trajectory generated based on the trend indicator conforms to the preset deterioration direction, then the disease development trend sequence is determined. Step S600 includes: S610. Obtain potential risk values and pathological descriptions from the risk assessment list; S620. Retrieve the medical knowledge base based on the pathological description information to construct a set of reference items to be sorted; S630. Quantitatively analyze the set of reference items to be ranked using the potential risk values to obtain severity and urgency values. S640. Calculate the ranking weight based on the severity value and the urgency value; S650. Arrange the set of reference items to be sorted in descending order according to the sorting weight to obtain an ordered sequence of reference items; S660. Determine the display order of each reference item according to the ordered reference item sequence and generate a comprehensive report. The comprehensive report is generated by encapsulating the ordered reference item sequence into a structured template.
2. The intelligent medical assistance method based on big data analysis according to claim 1, characterized in that, Step S100 includes: S110. Acquire multi-source medical data transmitted from heterogeneous medical information terminals, wherein the multi-source medical data includes structured physiological parameters and unstructured clinical text. S120. The unstructured clinical text is converted into a discrete semantic encoding sequence and aligned with the structured physiological parameters to form a multimodal raw data matrix. S130. Project the multimodal original data matrix using a dimension alignment matrix to obtain a unified high-dimensional feature vector; S140. Perform normalization encoding on the unified high-dimensional feature vector to generate a standardized feature representation, and construct a standardized medical dataset based on the standardized feature representation.
3. The intelligent medical assistance method based on big data analysis according to claim 1, characterized in that, Step S200 includes: S210. Obtain a standardized medical dataset and calculate the Euclidean distance between features in the standardized medical dataset to determine the feature similarity matrix; S220. Construct a feature density space based on the feature similarity matrix and determine the initial cluster centers; S230. Using a clustering algorithm, iteratively partition the standardized medical dataset based on the initial cluster centers to obtain initial feature groups; S240. If the intra-cluster cohesion of the initial feature group is greater than a preset cohesion threshold and the inter-cluster separation of the initial feature group is less than a preset separation threshold, then a merging operation is performed to obtain multiple feature clusters.
4. The intelligent medical assistance method based on big data analysis according to claim 3, characterized in that, Step S300 includes: S310. Obtain the source feature index table constructed from multiple feature clusters; S320. Generate a cross-source feature co-occurrence matrix based on the source feature index table; S330. Perform frequent pattern mining on the cross-source feature co-occurrence matrix to extract cross-data source co-occurrence feature patterns; S340. Calculate the statistical correlation of the cross-data source co-occurrence feature patterns. If the statistical correlation exceeds a preset correlation determination threshold, then determine the cross-data source co-occurrence feature patterns as an association relationship to obtain an association relationship set.
5. The intelligent medical assistance method based on big data analysis according to claim 1, characterized in that, Step S500 includes: S510. Extract abnormal physiological features from the disease progression trend sequence to calculate risk indicators; S520. If the risk indicator is greater than the preset risk level threshold, then the node corresponding to the risk indicator is marked as a high-risk node. S530. Obtain the deterioration rate corresponding to the high-risk node to extract multiple risk factors; S540. Multiple risk factors are fused using a weighting matrix to obtain the potential risk value; S550. Sort the potential risk values in descending order to generate a risk assessment list.
6. The intelligent medical assistance method based on big data analysis according to claim 1, characterized in that, Step S700 includes: S710. Parse the ordered reference item sequence in the comprehensive report and map the ordered reference item sequence to the diagnosis and treatment suggestion library to generate an initial auxiliary judgment result; The initial auxiliary judgment result is generated using the following formula: ; in, This is the initial auxiliary judgment result. For the reference item-treatment suggestion mapping function, For an ordered sequence of reference items, It serves as a database of medical advice. S720. Collect real-time monitoring values associated with the initial auxiliary judgment result and convert them into new data feature vectors; S730. The initial auxiliary judgment result is reweighted according to the deviation value between the new data feature vector and the benchmark feature vector to obtain the corrected reference item sequence; S740 encapsulates the corrected reference item sequence and outputs dynamically optimized auxiliary judgment results.
7. The intelligent medical assistance method based on big data analysis according to claim 6, characterized in that, In step S730, the deviation value between the new data feature vector and the reference feature vector is obtained by the following formula: ; in, This is the deviation value. For the new data feature vector, The baseline feature vector; The revised reference term sequence is obtained using the following formula: ; in, For the corrected reference item sequence, This is the deviation influence coefficient.
8. A smart medical assistance system based on big data analysis, used to execute the smart medical assistance method based on big data analysis as described in any one of claims 1 to 7, characterized in that, include: The standardized medical dataset acquisition module (10) is used to acquire multi-source medical data and convert the multi-source medical data into a unified dimension feature representation through a data processing module to obtain a standardized medical dataset. The multiple feature cluster acquisition module (20) is used to group the features in the standardized medical dataset using a clustering algorithm to obtain multiple feature clusters; The association set acquisition module (30) is used to extract co-occurrence feature patterns across data sources from multiple feature clusters, and determine them as associations when their statistical correlation exceeds a preset threshold in order to obtain an association set; The disease development trend sequence determination module (40) is used to construct a disease development prediction path based on the set of relationships and obtain trend indicators by traversing the path nodes to determine the disease development trend sequence. The risk assessment list generation module (50) is used to screen nodes whose risk indicators exceed a preset level from the disease development trend sequence as high-risk nodes, and to integrate multiple risk factors to calculate potential risk values to generate a risk assessment list. The display order determination module (60) is used to generate a comprehensive report based on the risk assessment list and determine the display order of each reference item in the comprehensive report using a priority sorting method; The auxiliary judgment result output and dynamic optimization module (70) is used to output the auxiliary judgment result based on the comprehensive report, and to incorporate new data through the data update module to dynamically optimize the auxiliary judgment result.