Early warning method for gastric precancerous lesion risk based on multi-source heterogeneous data clustering
By using a multi-source heterogeneous data clustering method, we specifically process structured clinical data, endoscopic imaging data, and omics data. Combined with attention mechanism networks and multi-round resampling clustering algorithms, we construct a risk warning model for precancerous lesions of the stomach. This solves the problems of data noise interference and low clustering accuracy in existing technologies, and achieves accurate capture and efficient warning of early risk signals.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- THE FIRST AFFILIATED HOSPITAL OF GUANGZHOU UNIV OF CHINESE MEDICINE
- Filing Date
- 2026-02-26
- Publication Date
- 2026-06-12
Smart Images

Figure CN122201756A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the interdisciplinary field of medical testing and artificial intelligence, specifically involving an early warning method for the risk of precancerous lesions of gastric cancer based on multi-source heterogeneous data clustering. Background Technology
[0002] The basic pathogenesis of precancerous lesions of the stomach is characterized by deficiency of the root and excess of the branch. The deficiency of the root is mainly due to spleen and stomach deficiency, while the excess of the branch is mainly due to blood stasis. Clinically, it often manifests as spleen and stomach weakness syndrome, gastric blood stasis syndrome, and liver and stomach qi stagnation syndrome. In Western medicine, precancerous lesions of the stomach mainly include two pathological states: dysplasia and intestinal metaplasia. These are important stages in the transformation of normal gastric mucosal tissue into gastric cancer, and their early warning is of great significance.
[0003] Current methods for early warning of precancerous lesions of the gastric cancer are mainly based on simple integration of multi-source data, including structured clinical data (such as medical history and examination results), unstructured imaging data (such as endoscopic images), and omics data (such as metabolomics and genomic data). The processing method often involves directly concatenating data from different sources and inputting it into traditional clustering algorithms (such as K-means). For structured data, missing values are filled using the mean; for imaging data, only basic grayscale feature extraction is performed; and no targeted dimensionality reduction processing is applied to omics data. These existing technologies have significant limitations: the data processing is crude—structured data is filled with the mean, endoscopic images are only extracted using basic grayscale features, and omics data lacks targeted dimensionality reduction; the data integration lacks scientific rigor, often directly concatenating data from different sources without considering the differences in characteristics among various data types, leading to significant data noise interference; and the clustering algorithms have shortcomings—traditional algorithms treat all data equally without assigning weights based on feature quality, resulting in low clustering accuracy and difficulty in accurately capturing early risk signals, ultimately leading to delayed warnings and failing to meet the clinical need for accurate early warning of precancerous lesions of the gastric cancer.
[0004] Therefore, there is an urgent need for an early warning method for the risk of precancerous lesions of the stomach based on multi-source heterogeneous data clustering. Summary of the Invention
[0005] The purpose of this invention is to provide an early warning method for the risk of precancerous lesions of the stomach based on multi-source heterogeneous data clustering. This method addresses the technical problems in existing technologies, such as insufficient integration of multi-source heterogeneous data, ineffective handling of feature differences between different types of data, large data noise interference due to simple splicing, and low clustering accuracy and difficulty in capturing early risk signals, which leads to delayed warnings.
[0006] To achieve the above objectives, the present invention adopts the following technical solution: Early warning methods for the risk of precancerous lesions of gastric cancer based on multi-source heterogeneous data clustering include: Step 1: Acquisition and preprocessing of multi-source heterogeneous data; Structured clinical data, endoscopic imaging data, metabolomics data, and gut microbiota biochemistry data of patients were collected and subjected to feature screening, quality control, and standardization to obtain three types of data sources after preliminary processing. Step 2: Differentiated data preprocessing and cluster analysis; Targeted preprocessing is performed on the three types of data sources to generate corresponding feature vectors; weights are assigned to the three types of feature vectors through an attention mechanism network, and the fused features are obtained after weighted concatenation and normalization. Based on the fusion features, a multi-round resampling clustering algorithm is used for clustering, and the optimal clustering result is selected by combining cohesion, separation and stability indicators. Step 3: Construction and application of risk early warning model; By combining clinical pathology gold standard and follow-up data, the optimal clustering results were labeled as different risk levels. A multi-class early warning model was constructed and validated using fusion features as input and risk level as label. New patient data is processed and input into the model, which outputs risk level, key risk factors and clinical intervention plan. A mechanism for regular and emergency model updates is established to dynamically optimize early warning performance.
[0007] Furthermore, the collection and processing of structured clinical data specifically involves: Complete clinical data of patients within the time frame of T1 were collected. The optimal value of T1 was determined by referring to the clinical progression cycle of precancerous lesions of the stomach and combining the frequency of clinical data updates through preliminary experiments. N specific features were extracted from the clinical data. Feature data were screened based on clinical guidelines, literature research and clinical relevance. Features that are significantly related to the progression of precancerous lesions of the stomach were screened by continuous features through Pearson correlation analysis and categorical features through chi-square test. Features with missing rate > θ were removed based on the availability of clinical data. Finally, the value of N was determined, and θ was set according to historical data and actual needs.
[0008] Furthermore, the acquisition of endoscopic image data specifically involves: High-definition white light endoscopes were used to cover key lesion-prone areas of the gastric antrum, gastric body, and gastric angle. At least a specified number of clear images were collected for each area to ensure complete image coverage of each key area. The total number of images collected, Mt, was determined through an image quality assessment experiment.
[0009] Furthermore, the collection and preliminary processing of omics data specifically involves: Metabolomics data were acquired using a liquid chromatography-tandem mass spectrometry platform. Chromatographic and mass spectrometric conditions were optimized for the target metabolites before detection to ensure that the resolution of each chromatographic peak and the quantitative precision of repeated injections met the actual requirements. Based on metabolomic data from the precancerous lesion group and the healthy control group, metabolites with significant differences in expression between the groups were screened, and their association with gastric mucosal lesions was verified in conjunction with clinical literature. Finally, a set of core metabolites was selected to form a detection set. After quality control, the concentration of each metabolite was standardized by Z-score. Gut microbiota data were obtained through high-throughput sequencing of 16S rRNA genes. Genera with an average relative abundance ≥ a1 were retained in all samples. Genera with significant differences between the lesion group and the control group were selected using differential analysis methods such as LEfSe to form a set of marker genera associated with gastric mucosal lesions. The relative abundance of the selected genera was logarithmically transformed to reduce the influence of extreme values and approximate a normal distribution. a1 was a preset threshold set according to historical data and actual needs.
[0010] Furthermore, targeted preprocessing is performed on the three types of data sources. The specific methods are as follows: Structured data is processed using the IQR method to remove outliers and the MICE method to fill in missing values, and features are encoded to generate vector C. Endoscopic images were scored using the Brenner gradient function to remove low-quality images, and depth features were extracted based on ResNet50 to generate vector I. After correcting extreme values and removing redundant features from omics data, dimensionality reduction is performed using t-SNE or PCA algorithms, retaining the core dimensions that meet the variance explanation rate to generate vector O.
[0011] Furthermore, weights are assigned to the three types of feature vectors through an attention mechanism network. After weighted concatenation and normalization, the fused features are obtained. The specific method is as follows: Min-Max normalization is performed on the three types of feature vectors, mapping them to the interval [0, 1]. A two-layer fully connected network is constructed as the attention mechanism network, with the correlation between clustered feature clusters and lesion progression as the loss target. The loss function utilizes... It means that, among them, Let represent the cross-entropy loss value, N represent the total number of samples participating in training, and i represent the sample index. This represents the true label of the i-th sample. This represents the predicted probability of the i-th sample output by the MLP network, calculated by a two-layer fully connected network. The input is a single-view feature vector, which is activated by ReLU and Sigmoid and outputs a feature vector that directly reflects the view's ability to predict the risk of the sample's progress. The training outputs scalar weights for each feature vector, and a truncation function is used to control the weights within a preset range. After weighting by weight × feature vector, the fused feature F is obtained by concatenating the vectors. F is then L2 normalized to ensure that the feature vector magnitude is 1, reducing the impact of scale differences.
[0012] Furthermore, based on the fusion features, a multi-round resampling clustering algorithm is used for clustering. The specific method is as follows: Using the L2-normalized fusion features as input, the ConsensusClusterPlus algorithm combined with K-means is used to set the range of cluster numbers and the number of resampling times, and Euclidean distance is used to quantify sample differences. After multiple rounds of resampling, the frequency of samples in the same cluster was statistically analyzed to construct a consensus matrix. The optimal number of clusters was selected by combining three-dimensional indicators, including the average silhouette coefficient, the Davidson-Baudin index, and the consensus matrix density, with clinical suitability. The consistency of clustering results for different resampled subsets was verified by adjusting the RAND index.
[0013] Furthermore, the consistency of clustering results for different resampled subsets was verified using an adjusted RAND index. The specific method is as follows: The consistency of clustering results for different resampled subsets is quantified using an adjusted RAND index, with the specific formula as follows: ,in This represents the number of sample pairs that are assigned to the same cluster in both clustering results. This represents the number of sample pairs that are evenly assigned to different clusters. These are spurious co-clustered sample pairs. A false heterogeneous cluster sample pair is a pair of samples that should belong to different clusters in the reference clustering result but are incorrectly assigned to the same cluster in the target clustering result.
[0014] Furthermore, the specific methods for constructing and applying risk warning models are as follows: Combining pathological gold standards and follow-up data, risk levels are categorized based on optimal clustering results. Using L2 normalized fusion features as input and risk levels as labels, a LightGBM multi-classification model is constructed. After training with 5-fold cross-validation and hyperparameter optimization, the model is validated using an independent test set. The optimal warning threshold is determined through DCA. New patient data is preprocessed and input into the model, which outputs risk levels, key risk factors, and corresponding clinical intervention plans. A regular update mechanism is established to supplement new samples, adjust data range and feature rules, and optimize warning performance.
[0015] In summary, due to the adoption of the above technical solution, the beneficial effects of the present invention are: 1. This invention implements a differentiated preprocessing strategy for multi-source heterogeneous data. For structured clinical data, it employs scientific methods for missing value imputation and outlier removal to ensure data integrity and logical consistency. For endoscopic imaging data, it extracts deep features through deep neural networks to enhance the ability to capture lesion-related information. For omics data, it performs targeted dimensionality reduction and redundant feature removal, retaining core correlation information. At the same time, it introduces an attention mechanism network to dynamically allocate weights based on the correlation between each feature view and lesion progression, avoiding noise interference caused by traditional simple splicing. This allows the fused features to more accurately reflect the essential laws of lesions, resulting in significant improvements in clustering results in sample cohesion, inter-cluster separation, and overall stability, laying a solid foundation for subsequent risk stratification. 2. This invention uses an attention mechanism network to dynamically assign weights to three types of feature views. Personalized weights are generated based on the correlation between each view's features and lesion progression, avoiding the problems of traditional methods that treat all features equally, leading to the weakening of important information and interference from secondary information. Simultaneously, by reasonably constraining the weight range, it prevents a single view from dominating the fusion result and avoids ignoring secondary views. After weighted splicing and normalization, the fused features can highlight the core information related to the lesion, significantly improving the relevance and accuracy of feature expression, laying a high-quality feature foundation for subsequent efficient clustering. 3. This invention employs a multi-round resampling clustering algorithm. By repeatedly sampling and verifying, it reduces the impact of random factors in a single clustering process. It combines a three-dimensional evaluation system of cohesion, separation, and stability to comprehensively select the optimal number of clusters, avoiding the limitations of a single evaluation indicator. This scientific clustering process not only ensures better performance in terms of intra-cluster sample similarity and inter-cluster sample difference, but also guarantees a high degree of alignment between the number of clusters and clinical risk stratification needs. It avoids both overly coarse stratification leading to unclear risk differentiation and overly fine stratification increasing the complexity of clinical applications. This ensures that the clustering results possess both good stability and reliability, directly serving subsequent risk level determination and connecting with actual clinical application scenarios. Attached Figure Description
[0016] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0017] Figure 1 The figure shows the steps of the method for early warning of gastric precancerous lesion risk based on multi-source heterogeneous data clustering of the present invention; Figure 2 The diagram illustrates the steps of differential data preprocessing and cluster analysis. Figure 3 The diagram illustrates the steps involved in constructing and applying a risk warning model. Detailed Implementation
[0018] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0019] like Figure 1 , Figure 2 , Figure 3 The method for early warning of gastric precancerous lesion risk based on multi-source heterogeneous data clustering, as shown, specifically includes the following steps: Step 1: Acquisition and preprocessing of multi-source heterogeneous data; Structured clinical data: Complete clinical data of patients within the time frame of T1 are collected through the hospital's electronic medical record system (EMR), laboratory information system (LIS), and physical examination management system. The data must be obtained with the patient's informed consent and comply with relevant medical data security regulations. The optimal value of T1 is determined by referring to the clinical progression cycle of precancerous lesions of the stomach and combining the frequency of clinical data updates, and by conducting preliminary experiments to verify (e.g., when T1 is taken as 6-18 months, the timeliness and completeness of the data are balanced). N specific features were extracted from the clinical data: Based on clinical guidelines, literature research, and clinical relevance screening, the feature data included, but were not limited to, age, gender, history of Helicobacter pylori infection, history of gastric disease, smoking history, alcohol consumption history, family history of gastric cancer, body mass index (BMI), pepsinogen-related indicators, gastrin-17 (G-17), blood routine-related indicators, blood glucose, and blood pressure. Continuous features were screened using Pearson correlation analysis and categorical features using chi-square test to screen features significantly associated with the progression of precancerous lesions of the stomach. Considering the availability of clinical data, features with a missing rate > θ (e.g., θ = 15%) were removed, and the final value of N was determined (usually 10 ≤ N ≤ 20). Endoscopic imaging data: Using a high-definition white light endoscope (equipped with a high-definition camera and image processing module), key lesion-prone sites (such as the gastric antrum (multiple wall surfaces), gastric body (upper / middle / lower parts), and gastric angle (anterior and posterior walls)) are covered. At least M clear images are acquired for each key lesion-prone site, and the total number of images acquired is Mt images per case. The specific values of M and Mt are based on clinical endoscopy examination standards to ensure complete image coverage of each key site, and are determined through image quality assessment experiments. Omics data: Metabolomics data were acquired using a liquid chromatography-tandem mass spectrometry (LC-MS / MS) platform. Chromatographic and mass spectrometric conditions were optimized prior to detection for target metabolites (such as pepsinogen I / II, gastrin-17, and differentially expressed metabolites associated with precancerous gastric lesions) to ensure peak resolution (R) ≥ 1.5 and quantitative precision (coefficient of variation, CV) ≤ 10% for repeated injections. Typical chromatographic conditions included: C18 reversed-phase column, column temperature 40℃, gradient elution of aqueous phase (containing 0.1% formic acid) and organic phase (acetonitrile / methanol), flow rate 0.3 mL / min, and injection volume 5 μL.
[0020] The set of metabolites to be tested was determined through the following process: Based on metabolomic data from the precancerous lesion group and the healthy control group, metabolites with significantly different expression between the groups (p<0.05 and fold change ≥1.5) were screened, and their association with gastric mucosal lesions was verified in conjunction with clinical literature. Finally, a core set of metabolites (e.g., 15–30) was selected to constitute the test set. Quantification was performed using the internal standard method to ensure accuracy. After quality control (removing batches or samples with CV>15%), the concentrations of each metabolite were Z-score standardized.
[0021] Gut microbiota data were obtained through high-throughput sequencing of the 16S rRNA gene. Samples were transported under refrigeration within 2 hours of collection and frozen until DNA extraction to maintain the stability of the microbiota structure. The V3-V4 region of the 16S rRNA gene was amplified and paired-end sequenced using a second-generation sequencing platform. After quality control, the raw sequences were clustered at 97% similarity to generate an OTU table. Further processing was performed to retain genera with an average relative abundance ≥0.01% in all samples. Genera showing significant differences between the lesion group and the control group (LDA score >2.0 and p<0.05) were selected using differential analysis methods such as LEfSe to construct a set of marker genera associated with gastric mucosal lesions (usually 10–30 genera). The relative abundance of the selected genera was logarithmically transformed to reduce the influence of extreme values and approximate a normal distribution.
[0022] Step 2: Differentiated data preprocessing and cluster analysis; Structured clinical data preprocessing: Extreme outliers were removed using the IQR method, and logically conflicting data (such as "no history of gastric disease but records of atrophic gastritis") were also removed simultaneously. Missing values are filled using the chain equation multiple interpolation (MICE) method, ensuring that the overall missing value rate after filling is less than θ. Binary features are encoded using 0-1, multi-class features are encoded using one-hot encoding, and continuous features retain their original values to generate a standardized feature vector C. The dimension is determined by clinical relevance screening and data quality verification. Potentially relevant features are screened based on clinical guidelines, and then features significantly related to disease progression are retained through Pearson correlation analysis (continuous features) and chi-square test (classification features), while features with a missing rate >5% are removed. Finally, a core clinical feature is selected, and the dimension corresponding to feature vector C is a.
[0023] Endoscopic image data preprocessing: Low-quality images with blurry or reflective interference are eliminated by Brenner gradient function scoring, and at least one valid image covering all key areas is retained for each patient. The RGB image is converted to grayscale, histogram equalization is used to enhance contrast, and Gaussian filtering is used to remove random noise. Deep features are extracted based on pre-trained CNNs (such as the ResNet series). Fully connected layers are removed, and the average value of image features from each part is taken to generate an image feature vector I. The dimension is determined by the structure of the pre-trained CNN network. ResNet50 is selected as the backbone network. After removing the fully connected layers, the fixed output dimension of the average pooling layer is 512. This dimension is the feature representation dimension optimized by the network after ImageNet pre-training.
[0024] Omics data preprocessing: For the standardized metabolomics data, the Z-score method was used to correct extreme values (replaced with ±3 when |Z|>3) and remove redundant features (features with stronger clinical relevance were retained when the correlation coefficient |r|≥0.8). Based on the relative abundance of marker bacterial genera after logarithmic transformation, outlier samples were removed, and the data after metabolomics and gut microbiota processing were merged. Dimensionality reduction was performed using t-SNE or PCA algorithms, retaining core dimensions with a variance explanation rate ≥85%, and generating an omics dimensionality reduction feature vector O. The dimension was determined by the dimensionality reduction algorithm constraints. Based on the original omics feature matrix dimensions (e.g., 12 metabolic indicators + 5 microbiota indicators representing 17 dimensions), when using t-SNE dimensionality reduction, the core constraint was a variance explanation rate ≥85%, while balancing feature redundancy and computational efficiency, ultimately determining the dimensionality after dimensionality reduction.
[0025] The feature vectors of the three types of views (C, I, O) are normalized using Min-Max (mapped to the [0, 1] interval). An attention mechanism network is trained, and a two-layer fully connected network (MLP) is constructed. The input is a single-view feature vector. The number of hidden units in the first layer is 32 (activation function = ReLU), and the number of hidden units in the second layer is 1 (activation function = Sigmoid). The output is a scalar weight value. The correlation between the clustered feature clusters and the disease progression is used as the loss target. The loss function is cross-entropy loss. Iterative training is performed until the loss converges (number of iterations ≤ 500, learning rate = 0.001, batch size = 64). The specific loss function is shown below: ; in, represents the cross-entropy loss value. The closer the value is to 0, the higher the consistency between the predicted probability output by the MLP and the true label of the sample. N represents the total number of samples participating in training, and i represents the sample index. This represents the true label of the i-th sample. This represents the predicted probability of the i-th sample output by the MLP network, which is calculated by a two-layer fully connected network (MLP): the input is a single-view feature vector (C, I, or O), which is output after ReLU and Sigmoid activation, directly reflecting the view's ability to predict the risk of the sample's progress.
[0026] Furthermore, the specific formulas for calculating the weights of the three types of feature view vectors (C, I, O) are as follows: ; Where v represents the v-th feature view vector. This represents the mapping function of a two-layer fully connected network (Multilayer Perceptron), taking the v-th feature view vector as input and outputting a scalar quantifying the importance of that view. It is a very small positive number.
[0027] The normalized weights are controlled within a variable range [a, b] by using a truncation function, where a is the lower limit of the weights, ranging from [0.05, 0.2], and its core function is to prevent secondary views (such as omics data views in small sample scenarios) from being ignored due to excessively low weights. b is the upper limit of the weights, ranging from [0.6, 0.9], and its core function is to prevent single views (such as endoscopic images with sufficient data) from having excessively high weights that dominate the fusion result.
[0028] The fused features are constructed using a weighted feature concatenation method, as shown in the following formula: ; in This represents vector concatenation, after fusion. The feature dimension is the sum of the dimensions of the three types of feature view vectors. After fusion, the features are L2 normalized. The fused feature vector after L2 normalization is then processed using the formula... It means that, among them This represents the k-th component in the fused feature vector F. This represents a very small positive number, ensuring that the feature vector magnitude is 1, thus reducing the impact of scale differences on cluster distance calculation; The ConsensusClusterPlus algorithm is adopted, based on the fusion features after L2 normalization. Multiple rounds of resampling clustering are performed to verify the stability of the clustering results through repeated sampling, avoiding the randomness of a single clustering. The specific process and parameters are as follows: Based on the routine needs of clinical risk stratification, and with the flexible space reserved for further subdivided risk gradients (such as low-to-medium, medium-to-high), the clustering results are ensured to be compatible with clinical practice, and a range for the number of clusters is set. The number of resampling times is 500, and the sampling ratio is 0.8 each time. Euclidean distance is used to quantify the feature differences between samples. By calculating the Euclidean distance between different samples, the geometric distance between samples in the fused feature vector is directly reflected, which is used for the similarity evaluation of continuous features.
[0029] The K-means algorithm was selected, with the following core parameter settings: number of iterations = 100, initialization method = K-means++, and clustering objective = minimizing the sum of squares (SSE) within clusters. ,in This represents the center vector of the i-th cluster (the mean vector of all sample features within the cluster). The essential properties of the cluster are the fused features after L2 normalization, obtained through ConsensusClusterPlus algorithm + K-means clustering. The results of grouping (the value of i matches the number of clusters K. In this scheme, K=4, so i=1, 2, 3, 4, corresponding to cluster 1, cluster 2, cluster 3, and cluster 4). Let SSE represent the sample set of the i-th cluster. The smaller the SSE value, the higher the similarity of the samples within the cluster. For the clustering results of 500 resampling operations, the frequency of each pair of samples (i, j) being assigned to the same cluster is counted, and a consensus matrix M is constructed. The matrix elements are defined as M(i, j) ∈ [0, 1], where M(i, j) is the frequency of samples (i, j) being assigned to the same cluster. A three-dimensional evaluation system of cohesion, separation, and stability is constructed. The optimal number of clusters is determined comprehensively through these three quantitative indicators, avoiding the limitations of a single indicator. The silhouette coefficient measures the cluster cohesion (similarity between samples within the same cluster) and separation (difference between samples in different clusters), using the formula... Calculate the profile coefficient, where This represents the average Euclidean distance between sample i and other samples in the same cluster. The average Euclidean distance between sample i and all samples in the nearest heterogeneous cluster is represented by the silhouette coefficients of all samples. The global average silhouette coefficient is obtained by calculating the silhouette coefficients of all samples and then taking the average value. The larger the average silhouette coefficient value, the better the clustering effect. The Davidson-Baudin index measures the balance between inter-cluster similarity and intra-cluster dispersion. The specific formula is as follows: ,in This represents the average intra-cluster distance of the i-th cluster, reflecting the intra-cluster dispersion. This represents the Euclidean distance between the centers of cluster i and cluster j, reflecting the inter-cluster separation. The smaller the value, the more concentrated the clusters are and the more separated the clusters are. The stability of clustering results is measured by the consensus matrix density, and the specific formula is as follows: Only the mean of the upper triangular matrix (excluding diagonal elements) is calculated, and a preset consensus matrix density threshold Ca is used. This indicates that the clustering results are highly consistent across different resampled subsets; conversely, a lower consistency indicates that the results are not consistent. The following table shows the indicators corresponding to different numbers of clusters in this embodiment. The optimal solution is selected through horizontal comparison: Table 1
[0030] Prioritize selection based on the highest average profile coefficient, lowest DBI, and CMD ≥ 1. And the number of clusters with the best clinical fit; The adjusted Rand Index (ARI) is used to quantify the consistency of clustering results across different resampled subsets. The formula is as follows: ; in This represents the number of sample pairs that are assigned to the same cluster in both clustering results. This represents the number of sample pairs that are evenly assigned to different clusters. These are spurious co-clustered sample pairs. A false heterocluster sample pair is a pair of samples that should belong to different clusters in the reference clustering result, but are incorrectly assigned to the same cluster in the target clustering result. Step 3: Construction and application of risk early warning model; Based on clinical pathology gold standards (such as pathological biopsy results and lesion follow-up progress records), the determined optimal clustering results (K=4 in this example) were risk-level labeled. Pathological diagnostic results (degree of dysplasia, grade of intestinal metaplasia) and follow-up data for a T2 period (e.g., 12-24 months) of all clustered samples were collected. The incidence and progression rate of high-risk lesions (moderate to severe dysplasia, severe intestinal metaplasia) within each cluster were statistically analyzed, such as: Cluster 1: High-risk lesion incidence <5%, no progression cases, classified as low-risk; Cluster 2: High-risk lesion incidence rate 5%~20%, progression rate <10% / year, classified as low to medium risk; Cluster 3: High-risk lesion incidence rate 20%~50%, progression rate 10%~30% / year, classified as medium to high risk level; Cluster 4: High-risk lesion incidence >50%, progression rate >30% / year, classified as high-risk level.
[0031] Establish a one-to-one correspondence between clusters and risk levels to form a risk level mapping dictionary, providing a basis for subsequent early warning output.
[0032] The generated L2 normalized fusion features Using the risk level as input and the labeled risk level as labels (using ordinal encoding: low risk = 0, low-medium risk = 1, medium-high risk = 2, high risk = 3), a multi-class early warning model is constructed. The model architecture uses a lightweight gradient boosting tree (LightGBM), with the following core parameter settings: learning rate = 0.01, number of decision trees = 200, maximum tree depth = 6, and number of leaf nodes = 32, to avoid overfitting. A 5-fold cross-validation is used to divide the training set and validation set (ratio 8:2). The multi-class log loss is used as the optimization objective, and training is iteratively continued until the validation set loss converges. An early stopping mechanism is introduced (training stops if the validation set loss does not decrease after 10 consecutive rounds). Hyperparameters are optimized through grid search to ensure the model's generalization ability.
[0033] The model performance was validated using an independent test set (sample size ≥ 20% of the total samples, with no overlap with the training set). Key evaluation metrics included: Classification performance: overall accuracy, precision for each risk level, recall, and F1 score, ensuring a recall rate of ≥90% for high-risk levels (reducing missed diagnoses). Prediction reliability: Multi-class AUC value (calculated based on a one-to-one strategy), requiring AUC ≥ 0.85; Clinical applicability: Decision curve analysis (DCA) was performed to determine the optimal warning threshold, maximizing the net benefit of the model within a clinically acceptable false positive rate (≤15%).
[0034] The model threshold is adjusted based on the validation results. When the predicted probability of a new sample's risk level exceeds the corresponding threshold, a warning for the corresponding level is triggered.
[0035] For newly admitted or physically examined patients, after completing steps one and two of data collection, preprocessing, and feature fusion, the data is input into the trained and optimized early warning model, and the following results are output: Risk level assessment: Directly provide a conclusion of low / low-medium / medium-high / high risk; Risk Contribution Analysis: Based on SHAP (SHapley Additive ex Planations) values, the contribution of each feature view (clinical C, imaging I, omics O) and core features (such as history of Helicobacter pylori infection, endoscopic image depth features, differential metabolite concentration) to the risk level is quantified, and key risk factors are output. Clinical recommendations are generated based on risk levels and standardized intervention protocols. For low-risk patients, routine follow-up (1-2 years / time) is recommended. For low-to-medium risk patients, follow-up every 6 months or 1 year and lifestyle intervention are recommended. For medium-to-high risk patients, follow-up every 3-6 months and endoscopic re-examination are recommended. For high-risk patients, emergency endoscopic examination and pathological biopsy are recommended every 1-3 months.
[0036] Establish a model iterative update mechanism: New clinical samples (including complete data and follow-up results) are regularly (every 6-12 months) added to the training set, and the feature fusion in step two and the model training in step three are re-executed to optimize the risk level mapping relationship. The monitoring model's performance indicators (such as missed diagnosis rate and misdiagnosis rate) in actual clinical applications are used to trigger an emergency update when the indicators deviate from the preset thresholds (missed diagnosis rate > 10% or misdiagnosis rate > 20%). By combining the latest clinical guidelines and newly added biomarkers (such as novel metabolomics indicators and microbial biomarkers), we dynamically adjust the scope of data collection and feature screening rules to continuously improve the timeliness and accuracy of early warnings.
[0037] The above formulas are all dimensionless calculations, and the preset parameters in the formulas should be set by those skilled in the art according to the actual situation.
[0038] The above description is only a preferred embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any equivalent substitutions or modifications made by those skilled in the art within the scope of the technology disclosed in the present invention, based on the technical solution and inventive concept of the present invention, should be covered within the scope of protection of the present invention.
[0039] The preferred embodiments of the present invention disclosed above are merely illustrative of the invention. These preferred embodiments do not exhaustively describe all details, nor do they limit the invention to specific implementations. Clearly, many modifications and variations can be made based on the content of this specification. This specification selects and specifically describes these embodiments to better explain the principles and practical applications of the invention, thereby enabling those skilled in the art to better understand and utilize the invention. The invention is limited only by the claims and their full scope and equivalents.
Claims
1. An early warning method for the risk of precancerous lesions of gastric cancer based on multi-source heterogeneous data clustering, characterized in that, include: Step 1: Acquisition and preprocessing of multi-source heterogeneous data; Structured clinical data, endoscopic imaging data, metabolomics data, and gut microbiota biochemistry data of patients were collected and subjected to feature screening, quality control, and standardization to obtain three types of data sources after preliminary processing. Step 2: Differentiated data preprocessing and cluster analysis; Targeted preprocessing is performed on the three types of data sources to generate corresponding feature vectors; weights are assigned to the three types of feature vectors through an attention mechanism network, and the fused features are obtained after weighted concatenation and normalization. Based on the fusion features, a multi-round resampling clustering algorithm is used for clustering, and the optimal clustering result is selected by combining cohesion, separation and stability indicators. Step 3: Construction and application of risk early warning model; By combining clinical pathology gold standard and follow-up data, the optimal clustering results were labeled as different risk levels. A multi-class early warning model was constructed and validated using fusion features as input and risk level as label. New patient data is processed and input into the model, which outputs risk level, key risk factors and clinical intervention plan. A mechanism for regular and emergency model updates is established to dynamically optimize early warning performance.
2. The method for early warning of gastric precancerous lesion risk based on multi-source heterogeneous data clustering according to claim 1, characterized in that, The collection and processing of structured clinical data specifically involves: Complete clinical data of patients within the time frame of T1 were collected. The optimal value of T1 was determined by referring to the clinical progression cycle of precancerous lesions of the stomach and combining the frequency of clinical data updates through preliminary experiments. N specific features were extracted from the clinical data. Feature data were screened based on clinical guidelines, literature research and clinical relevance. Features that are significantly related to the progression of precancerous lesions of the stomach were screened by continuous features through Pearson correlation analysis and categorical features through chi-square test. Features with missing rate > θ were removed based on the availability of clinical data. Finally, the value of N was determined, and θ was set according to historical data and actual needs.
3. The method for early warning of gastric precancerous lesion risk based on multi-source heterogeneous data clustering according to claim 1, characterized in that, The acquisition of endoscopic image data is specifically as follows: High-definition white light endoscopes were used to cover key lesion-prone areas of the gastric antrum, gastric body, and gastric angle. At least a specified number of clear images were collected for each area to ensure complete image coverage of each key area. The total number of images collected, Mt, was determined through an image quality assessment experiment.
4. The method for early warning of gastric precancerous lesion risk based on multi-source heterogeneous data clustering according to claim 1, characterized in that, The collection and preliminary processing of omics data are as follows: Metabolomics data were acquired using a liquid chromatography-tandem mass spectrometry platform. Chromatographic and mass spectrometric conditions were optimized for the target metabolites before detection to ensure that the resolution of each chromatographic peak and the quantitative precision of repeated injections met the actual requirements. Based on metabolomic data from the precancerous lesion group and the healthy control group, metabolites with significant differences in expression between the groups were screened, and their association with gastric mucosal lesions was verified in conjunction with clinical literature. Finally, a set of core metabolites was selected to form a detection set. After quality control, the concentration of each metabolite was standardized by Z-score. Gut microbiota data were obtained through high-throughput sequencing of 16S rRNA genes. Genera with an average relative abundance ≥ a1 were retained in all samples. Genera with significant differences between the lesion group and the control group were selected using differential analysis methods such as LEfSe to form a set of marker genera associated with gastric mucosal lesions. The relative abundance of the selected genera was logarithmically transformed to reduce the influence of extreme values and approximate a normal distribution. a1 was a preset threshold set according to historical data and actual needs.
5. The method for early warning of gastric precancerous lesion risk based on multi-source heterogeneous data clustering according to claim 1, characterized in that, Targeted preprocessing is performed on the three types of data sources. The specific methods are as follows: Structured data is processed using the IQR method to remove outliers and the MICE method to fill in missing values, and features are encoded to generate vector C. Endoscopic images were scored using the Brenner gradient function to remove low-quality images, and depth features were extracted based on ResNet50 to generate vector I. After correcting extreme values and removing redundant features from omics data, dimensionality reduction is performed using t-SNE or PCA algorithms, retaining the core dimensions that meet the variance explanation rate to generate vector O.
6. The method for early warning of gastric precancerous lesion risk based on multi-source heterogeneous data clustering according to claim 1, characterized in that, Weights are assigned to the three types of feature vectors using an attention mechanism network. After weighted concatenation and normalization, the fused features are obtained. The specific method is as follows: Min-Max normalization is performed on the three types of feature vectors, mapping them to the interval [0, 1]. A two-layer fully connected network is constructed as the attention mechanism network, with the correlation between clustered feature clusters and lesion progression as the loss target. The loss function utilizes... It means that, among them, Let represent the cross-entropy loss value, N represent the total number of samples participating in training, and i represent the sample index. This represents the true label of the i-th sample. This represents the predicted probability of the i-th sample output by the MLP network, calculated by a two-layer fully connected network. The input is a single-view feature vector, which is activated by ReLU and Sigmoid and outputs a feature vector that directly reflects the view's ability to predict the risk of the sample's progress. The training outputs scalar weights for each feature vector, and a truncation function is used to control the weights within a preset range. After weighting by weight × feature vector, the fused feature F is obtained by concatenating the vectors. F is then L2 normalized to ensure that the feature vector magnitude is 1, reducing the impact of scale differences.
7. The method for early warning of gastric precancerous lesion risk based on multi-source heterogeneous data clustering according to claim 1, characterized in that, Based on fusion features, a multi-round resampling clustering algorithm is used for clustering. The specific method is as follows: Using the L2-normalized fusion features as input, the ConsensusClusterPlus algorithm combined with K-means is used to set the range of cluster numbers and the number of resampling times, and Euclidean distance is used to quantify sample differences. After multiple rounds of resampling, the frequency of samples in the same cluster was statistically analyzed to construct a consensus matrix. The optimal number of clusters was selected by combining three-dimensional indicators, including the average silhouette coefficient, the Davidson-Baudin index, and the consensus matrix density, with clinical suitability. The consistency of clustering results for different resampled subsets was verified by adjusting the RAND index.
8. The method for early warning of gastric precancerous lesion risk based on multi-source heterogeneous data clustering according to claim 7, characterized in that, The consistency of clustering results for different resampled subsets was verified using an adjusted RAND index. The specific method is as follows: The consistency of clustering results for different resampled subsets is quantified using an adjusted RAND index, with the specific formula as follows: ,in This represents the number of sample pairs that are assigned to the same cluster in both clustering results. This represents the number of sample pairs that are evenly assigned to different clusters. These are spurious co-clustered sample pairs. A false heterogeneous cluster sample pair is a pair of samples that should belong to different clusters in the reference clustering result but are incorrectly assigned to the same cluster in the target clustering result.
9. The method for early warning of gastric precancerous lesion risk based on multi-source heterogeneous data clustering according to claim 1, characterized in that, The specific methods for constructing and applying risk warning models are as follows: Combining pathological gold standards and follow-up data, risk levels are categorized based on optimal clustering results. Using L2 normalized fusion features as input and risk levels as labels, a LightGBM multi-classification model is constructed. After training with 5-fold cross-validation and hyperparameter optimization, the model is validated using an independent test set. The optimal warning threshold is determined through DCA. New patient data is preprocessed and input into the model, which outputs risk levels, key risk factors, and corresponding clinical intervention plans. A regular update mechanism is established to supplement new samples, adjust data range and feature rules, and optimize warning performance.