Cancer subtype identification method and system based on adaptive denoising and contrast learning

By employing adaptive denoising and contrastive learning methods, and utilizing a pre-trained Transformer encoder to extract cancer gene features and perform attribution analysis, the problem of identifying complex interactive relationships in high-dimensional data was solved, enabling accurate subtyping and interpretable diagnosis of cancer subtypes.

CN122245697APending Publication Date: 2026-06-19FOSHAN UNIVERSITY

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
FOSHAN UNIVERSITY
Filing Date
2026-02-02
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing technologies struggle to effectively capture the complex interactions in cancer gene expression data when processing high-dimensional, nonlinear, multi-omics data. Furthermore, traditional methods rely on large amounts of labeled data or lack clinical information, resulting in insufficient classification quality and interpretability for cancer subtype identification.

Method used

Adaptive denoising and contrastive learning methods are employed to extract deep features through a pre-trained denoised Transformer encoder, and attribution analysis is performed using integral gradient technology to identify cancer subtypes and output a list of key genes.

🎯Benefits of technology

It enables precise identification of cancer subtypes, provides interpretable classification conclusions and potential therapeutic targets, improves the efficiency, accuracy and clinical relevance of molecular subtyping, and provides data-driven support for personalized medicine.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122245697A_ABST
    Figure CN122245697A_ABST
Patent Text Reader

Abstract

This invention discloses a method and system for cancer subtype identification based on adaptive denoising and contrastive learning. The method includes: acquiring gene expression data of a target patient and performing standardized preprocessing to obtain initial data; extracting features from the initial data using a pre-trained denoising Transformer encoder to obtain low-dimensional feature embedding vectors; calculating the similarity between the low-dimensional feature embedding vectors and each cluster center in a pre-defined set of cluster centers, and determining the target cancer subtype label corresponding to the target patient based on the mapping relationship corresponding to the cluster center with the highest similarity; and performing attribution analysis on the decision-making process of classifying the target patient into the target cancer subtype label based on integral gradient technology, identifying and ranking the genes that contribute the most to the decision, and obtaining a list of key biomarkers. Therefore, this invention can accurately identify cancer subtypes from patient gene data and simultaneously output a list of key genes driving the subtype classification conclusion.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of cancer subtype identification technology at the intersection of artificial intelligence and bioinformatics, and in particular to a cancer subtype identification method and system based on adaptive denoising and contrastive learning. Background Technology

[0002] Malignant tumors have become a major threat to human health worldwide, and their incidence rate is increasing year by year. Among them, pancreatic cancer, lung cancer, and breast cancer, due to their high heterogeneity and complex molecular characteristics, lead to significant differences in patient prognosis. One of the main reasons for the difficulty in treating heterogeneous tumor patients is the huge difference in sensitivity to treatment regimens among different molecular subtypes. The same treatment method can lead to different survival outcomes for patients with different subtypes, requiring targeted treatment for different subtypes.

[0003] With the rapid development of artificial intelligence technology, it has become possible to identify tumor subtypes by comprehensively analyzing multi-omics data and using machine learning methods to describe the molecular characteristics of cancer patients from multiple perspectives. Accurately identifying tumor subtypes by combining machine learning methods with multi-omics data can help optimize treatment strategies, improve treatment outcomes, and increase patient survival rates.

[0004] However, current machine learning-based tumor subtype identification methods still face many challenges when processing high-dimensional, nonlinear, multi-omics data:

[0005] 1. Traditional methods (such as principal component analysis combined with K-Means) are difficult to effectively capture the complex interaction relationships between features of high-dimensional nonlinear gene expression data, resulting in limited typing quality.

[0006] 2. Although supervised learning-based typing methods have good performance, they rely heavily on a large amount of high-quality labeled data, which is costly to obtain in real clinical scenarios.

[0007] 3. Existing unsupervised learning methods typically employ a two-stage process of independent feature extraction and clustering, lacking collaborative optimization and often ignoring key clinical indicators such as patient survival information, resulting in weak correlation between classification results and prognosis.

[0008] 4. The "black box" nature of deep learning models leads to poor interpretability and difficulty in identifying key biomarkers, which limits their application in clinical decision-making and targeted therapy development. Summary of the Invention

[0009] The technical problem to be solved by this invention is to provide a cancer subtype identification method and system based on adaptive denoising and contrastive learning, which can accurately identify cancer subtypes from patient gene data and simultaneously output a list of key genes driving the subtyping conclusion, providing clinicians with integrated and interpretable decision support for molecular subtyping and therapeutic target discovery.

[0010] To address the aforementioned technical problems, the first aspect of this invention discloses a cancer subtype identification method based on adaptive denoising and contrastive learning, the method comprising the following steps: Gene expression data of the target patient is obtained, and the gene expression data is standardized and preprocessed to obtain initial data; The initial data is processed by a pre-trained denoising Transformer encoder to extract features and obtain a low-dimensional feature embedding vector. Calculate the similarity between the low-dimensional feature embedding vector and each cluster center in the preset cluster center set, and determine the target cancer subtype label corresponding to the target patient based on the mapping relationship corresponding to the cluster center with the highest similarity. Based on integral gradient technology, attribution analysis is performed on the decision-making process of classifying the target patient into the target cancer subtype label, and the genes that contribute the most to the decision are identified and ranked to obtain a list of key biomarkers.

[0011] As an optional implementation, in the first aspect of the present invention, the standardization preprocessing of the gene expression data to obtain initial data includes the following steps: The gene expression data is converted to a numerical type and missing values ​​are imputed using the median to obtain processed data; The processed data is aligned with a preset standardized gene panel using gene identifiers to obtain aligned data; The aligned protective gear is standardized to obtain initial data.

[0012] As an optional implementation, in the first aspect of the present invention, the step of extracting features from the initial data using a pre-trained denoising Transformer encoder to obtain a low-dimensional feature embedding vector includes the following steps: The initial data is mapped to the model dimension through an input linear projection layer to obtain projected features; Add position encoding information to the projection feature to obtain the projection feature with added position encoding information; The projected features with added position encoding information are input into at least one Transformer encoding block for processing to obtain the processed features; The processed features are vector normalized to obtain low-dimensional feature embedding vectors.

[0013] As an optional implementation, in a first aspect of the invention, the Transformer coding block includes a multi-head self-attention mechanism and a feedforward neural network.

[0014] As an optional implementation, in the first aspect of the present invention, the step of calculating the similarity between the low-dimensional feature embedding vector and each cluster center in a preset set of cluster centers, and determining the target cancer subtype label corresponding to the target patient based on the mapping relationship corresponding to the cluster center with the highest similarity, includes the following steps: Calculate the similarity between the low-dimensional feature embedding vector and each cluster center in the preset cluster center set to obtain a set of similarity values; The maximum value is determined from the set of similarity values, and the cluster center corresponding to the maximum value is determined as the matching cluster center; Query the preset mapping table between cluster centers and cancer subtype labels, obtain the cancer subtype label that uniquely corresponds to the matching cluster center, and use it as the target cancer subtype label.

[0015] As an optional implementation, in the first aspect of the present invention, the mapping table is established during the model training phase, and each cluster center is assigned a corresponding cancer subtype label based on the clustering results of historical patient data and clinical information analysis.

[0016] As an optional implementation, in the first aspect of the present invention, the attribution analysis of the decision process for classifying the target patient into the target cancer subtype label based on integral gradient technology, identifying and ranking several genes that contribute the most to the decision, and obtaining a list of key biomarkers, includes the following steps: A linear interpolation path is constructed between the initial data and a baseline input, and multiple interpolation points are sampled along the path; For each interpolation point, the attribution objective function is used. The gradient of its input features is calculated, where For denoising, use a Transformer encoder, where x is the initial data and centroid is the cluster center with the highest similarity. Integrate all gradients calculated through the attribution objective function along the interpolation path to obtain the integral gradient value corresponding to each gene feature; Genes are sorted according to the absolute value of the integral gradient corresponding to each gene feature, and the top-ranked genes are selected to form the list of key biomarkers.

[0017] A second aspect of this invention discloses a cancer subtype identification system based on adaptive denoising and contrastive learning, the system comprising: The acquisition module is used to acquire gene expression data of the target patient and perform standardized preprocessing on the gene expression data to obtain initial data. The extraction module is used to extract features from the initial data using a pre-trained denoising Transformer encoder to obtain a low-dimensional feature embedding vector. The determination module is used to calculate the similarity between the low-dimensional feature embedding vector and each cluster center in the preset cluster center set, and determine the target cancer subtype label corresponding to the target patient based on the mapping relationship corresponding to the cluster center with the highest similarity. The attribution analysis module is used to perform attribution analysis on the decision-making process of classifying the target patient into the target cancer subtype label based on integral gradient technology, identify and rank the genes that contribute the most to the decision, and obtain a list of key biomarkers.

[0018] As an optional implementation, in a second aspect of the present invention, the acquisition module performs standardized preprocessing on the gene expression data to obtain initial data, including the following steps: The gene expression data is converted to a numerical type and missing values ​​are imputed using the median to obtain processed data; The processed data is aligned with a preset standardized gene panel using gene identifiers to obtain aligned data; The aligned protective gear is standardized to obtain initial data.

[0019] As an optional implementation, in a second aspect of the invention, the extraction module extracts features from the initial data using a pre-trained denoising Transformer encoder to obtain a low-dimensional feature embedding vector, comprising the following steps: The initial data is mapped to the model dimension through an input linear projection layer to obtain projected features; Add position encoding information to the projection feature to obtain the projection feature with added position encoding information; The projected features with added position encoding information are input into at least one Transformer encoding block for processing to obtain the processed features; The processed features are vector normalized to obtain low-dimensional feature embedding vectors.

[0020] As an optional implementation, in a second aspect of the invention, the Transformer coding block includes a multi-head self-attention mechanism and a feedforward neural network.

[0021] As an optional implementation, in a second aspect of the present invention, the determining module calculates the similarity between the low-dimensional feature embedding vector and each cluster center in a preset set of cluster centers, and determines the target cancer subtype label corresponding to the target patient based on the mapping relationship corresponding to the cluster center with the highest similarity, including the following steps: Calculate the similarity between the low-dimensional feature embedding vector and each cluster center in the preset cluster center set to obtain a set of similarity values; The maximum value is determined from the set of similarity values, and the cluster center corresponding to the maximum value is determined as the matching cluster center; Query the preset mapping table between cluster centers and cancer subtype labels, obtain the cancer subtype label that uniquely corresponds to the matching cluster center, and use it as the target cancer subtype label.

[0022] As an optional implementation, in the second aspect of the present invention, the mapping table is established during the model training phase, and each cluster center is assigned a corresponding cancer subtype label based on the clustering results of historical patient data and clinical information analysis.

[0023] As an optional implementation, in a second aspect of the invention, the attribution analysis module, based on integral gradient technology, performs attribution analysis on the decision-making process of classifying the target patient into the target cancer subtype label, identifies and ranks several genes that contribute the most to the decision, and obtains a list of key biomarkers, including the following steps: A linear interpolation path is constructed between the initial data and a baseline input, and multiple interpolation points are sampled along the path; For each interpolation point, the attribution objective function is used. The gradient of its input features is calculated, where For denoising, use a Transformer encoder, where x is the initial data and centroid is the cluster center with the highest similarity. Integrate all gradients calculated through the attribution objective function along the interpolation path to obtain the integral gradient value corresponding to each gene feature; Genes are sorted according to the absolute value of the integral gradient corresponding to each gene feature, and the top-ranked genes are selected to form the list of key biomarkers.

[0024] A third aspect of this invention discloses a cancer subtype identification device based on adaptive denoising and contrastive learning, the device comprising: Memory containing executable program code; A processor coupled to the memory; The processor calls the executable program code stored in the memory to execute some or all of the steps in the cancer subtype identification method based on adaptive denoising and contrastive learning disclosed in the first aspect of the present invention.

[0025] The fourth aspect of the present invention discloses a computer storage medium storing computer instructions, which, when invoked, are used to execute some or all of the steps in the cancer subtype identification method based on adaptive denoising and contrastive learning disclosed in the first aspect of the present invention.

[0026] Compared with existing technologies, the embodiments of the present invention have the following beneficial effects: standardized preprocessing ensures data quality, a pre-trained denoising Transformer encoder robustly extracts deep molecular features, and reliable cancer subtype identification is achieved based on accurate matching with known subtype templates. Finally, with the help of interpretable attribution analysis technology, not only is a subtype diagnostic conclusion output, but also a list of key genes driving the diagnosis is provided simultaneously. Thus, in a single analysis, accurate subtyping, basic judgment of prognostic association, and discovery of potential therapeutic targets are completed simultaneously, significantly improving the efficiency, accuracy, clinical relevance, and decision interpretability of cancer molecular subtyping, and providing strong data-driven support for personalized medicine. Attached Figure Description

[0027] To more clearly illustrate the technical solutions in the embodiments of the present invention, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0028] Figure 1 This is a flowchart illustrating a cancer subtype identification method based on adaptive denoising and contrastive learning disclosed in an embodiment of the present invention. Figure 2 It is a survival curve based on the MESO dataset, which aggregates two classes. Figure 3 It is a survival curve based on the three classes of the MESO dataset; Figure 4 It is a survival curve based on the three categories of the READ dataset; Figure 5 It is a survival curve based on the four classes of the UCEC dataset; Figure 6 This is a schematic diagram showing the top ten gene importances in terms of feature importance for the interpretability of deep learning in this invention; Figure 7 This is a schematic diagram of a cancer subtype identification system based on adaptive denoising and contrastive learning disclosed in an embodiment of the present invention; Figure 8 This is a schematic diagram of the structure of a cancer subtype identification device based on adaptive denoising and contrastive learning disclosed in an embodiment of the present invention; Detailed Implementation To enable those skilled in the art to better understand the present invention, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present invention. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0029] The terms "first," "second," etc., used in the specification, claims, and accompanying drawings of this invention are used to distinguish different objects, not to describe a specific order. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion. For example, a process, method, apparatus, product, or end that includes a series of steps or units is not limited to the listed steps or units, but may optionally include steps or units not listed, or may optionally include other steps or units inherent to these processes, methods, products, or ends.

[0030] In this document, the term "embodiment" means that a particular feature, structure, or characteristic described in connection with an embodiment may be included in at least one embodiment of the invention. The appearance of this phrase in various places throughout the specification does not necessarily refer to the same embodiment, nor is it a separate or alternative embodiment mutually exclusive with other embodiments. It will be explicitly and implicitly understood by those skilled in the art that the embodiments described herein can be combined with other embodiments.

[0031] This invention discloses a cancer subtype identification method and system based on adaptive denoising and contrastive learning. Standardized preprocessing ensures data quality, a pre-trained denoising Transformer encoder robustly extracts deep molecular features, and reliable cancer subtype identification is achieved based on precise matching with known subtype templates. Finally, interpretable attribution analysis technology is used to not only output subtype diagnostic conclusions but also simultaneously provide a list of key genes driving the diagnosis. Thus, in a single analysis, accurate subtyping, prognostic correlation assessment, and potential therapeutic target discovery are all completed simultaneously, significantly improving the efficiency, accuracy, clinical relevance, and decision interpretability of cancer molecular subtyping, providing powerful data-driven support for personalized medicine. Detailed explanations follow.

[0032] Example 1 Please see Figure 1 , Figure 1This is a flowchart illustrating a cancer subtype identification method based on adaptive denoising and contrastive learning disclosed in an embodiment of the present invention. Figure 1 The described method is applied to a cancer subtype identification device based on adaptive denoising and contrastive learning. This identification device can be a corresponding identification terminal, identification equipment, or server, and the server can be a local server or a cloud server; the embodiments of this invention are not limited thereto. Figure 1 As shown, this cancer subtype identification method based on adaptive denoising and contrastive learning may include the following operations: Step S101: Obtain gene expression data of the target patient and perform standardized preprocessing on the gene expression data to obtain initial data.

[0033] Step S102: Extract features from the initial data using the pre-trained denoising Transformer encoder to obtain a low-dimensional feature embedding vector.

[0034] Step S103: Calculate the similarity between the low-dimensional feature embedding vector and each cluster center in the preset cluster center set, and determine the target cancer subtype label corresponding to the target patient based on the mapping relationship corresponding to the cluster center with the highest similarity.

[0035] Step S104: Based on integral gradient technology, perform attribution analysis on the decision-making process of classifying target patients into target cancer subtype labels, identify and rank the genes that contribute the most to the decision, and obtain a list of key biomarkers.

[0036] As can be seen, the method described in the embodiments of the present invention can ensure data quality through standardized preprocessing, robustly extract deep molecular features using a pre-trained denoising Transformer encoder, and achieve reliable cancer subtype identification based on accurate matching with known subtype templates. Finally, with the help of interpretable attribution analysis technology, it not only outputs subtype diagnostic conclusions but also simultaneously provides a list of key genes driving the diagnosis. Thus, in a single analysis, it simultaneously completes accurate subtyping, basic judgment of prognostic associations, and the discovery of potential therapeutic targets, significantly improving the efficiency, accuracy, clinical relevance, and decision interpretability of cancer molecular subtyping, and providing strong data-driven support for personalized medicine.

[0037] In an optional implementation, the standardization preprocessing of gene expression data in step S101 to obtain initial data includes the following steps: The gene expression data was converted to a numerical type and missing values ​​were filled using the median to obtain the processed data; The processed data is aligned with the pre-defined standardized gene panel using gene markers to obtain aligned data. The aligned protective gear is standardized to obtain initial data.

[0038] In this embodiment of the invention, the gene expression data obtained from the target patient is converted into a numerical type, converting the original gene expression data, which may contain non-numerical characters, into a uniform floating-point type. For missing values, the median of the gene expression level in all samples is used for imputation. For the imputed gene expression data, gene labels are aligned using a preset standardized gene panel, which is used for independent validation during training. Finally, the aligned gene expression data is Z-score standardized using sklearn's StandardScaler, that is, each feature is subtracted from its mean and divided by its standard deviation, so that the data distribution has a mean of 0 and a standard deviation of 1, thus obtaining the initial data.

[0039] In an optional implementation, step S102, where the pre-trained denoising Transformer encoder extracts features from the initial data to obtain a low-dimensional feature embedding vector, includes the following steps: The initial data is mapped to the model dimension through the input linear projection layer to obtain projected features; Add positional encoding information to the projected features to obtain projected features with added positional encoding information; The projected features with added position encoding information are input into at least one Transformer encoding block for processing to obtain the processed features; The processed features are vector normalized to obtain low-dimensional feature embedding vectors.

[0040] In this embodiment of the invention, the denoising Transformer encoder includes an input linear projection layer, a position coding layer, a multi-layer Transformer coding block, and an embedding normalization layer.

[0041] Input linear projection layer for initial data From the feature dimension Projected onto model dimension To obtain projection features The calculation process is as follows: ; Projection features The feature matrix after transformation by the linear projection layer. It is the weight matrix of the linear projection layer, which realizes a linear transformation from the original feature space to the model feature space. It is the bias vector of the linear projection layer. This process maps the input features to the model dimension through a learnable linear projection layer, laying the foundation for subsequent sequence modeling.

[0042] The positional coding layer uses sine and cosine functions to generate positional coding information, which is then added to the projected features to preserve sequence order information, resulting in projected features with added positional coding information. The calculation formula is as follows:

[0043] The projected features, with added positional encoding information, are input into multiple concatenated Transformer encoding blocks to obtain processed features. Each encoding block contains a multi-head self-attention layer and a feedforward neural network layer, and residual connections and layer normalization are used for stable training. The self-attention is calculated as follows: ; The embedded normalization layer performs L2 normalization on the processed features. This yields low-dimensional feature embedding vectors.

[0044] In an optional implementation, step S103, which calculates the similarity between the low-dimensional feature embedding vector and each cluster center in a preset set of cluster centers, and determines the target cancer subtype label corresponding to the target patient based on the mapping relationship corresponding to the cluster center with the highest similarity, includes the following steps: Calculate the similarity between the low-dimensional feature embedding vector and each cluster center in the preset cluster center set to obtain a set of similarity values; The maximum value is determined from a set of similarity values, and the cluster center corresponding to the maximum value is determined as the matching cluster center; Query the predefined mapping table between cluster centers and cancer subtype labels, obtain the cancer subtype label that uniquely corresponds to the matching cluster center, and use it as the target cancer subtype label.

[0045] In this embodiment of the invention, the cosine similarity between the low-dimensional feature embedding vector and each cluster center in a preset set of cluster centers is calculated. This yields a set of similarity values; The mapping table is established during the model training phase. It is built by analyzing the clinical characteristics of each cluster. Based on the clustering results of historical patient data and clinical information analysis, each cluster center is assigned a corresponding cancer subtype label.

[0046] In an optional implementation, step S104, based on integral gradient technology, performs attribution analysis on the decision-making process of classifying target patients into target cancer subtype labels, identifies and ranks several genes that contribute the most to the decision, and obtains a list of key biomarkers, including the following steps: A linear interpolation path is constructed between the initial data and a baseline input, and multiple interpolation points are sampled along the path; For each interpolation point, the attribution objective function is used. The gradient of its input features is calculated, where For denoising, use a Transformer encoder, where x is the initial data and centroid is the cluster center with the highest similarity. Integrate all gradients calculated by the attribution objective function along the interpolation path to obtain the integral gradient values ​​corresponding to each gene feature; Genes are sorted according to the absolute value of the integral gradient corresponding to each gene feature, and the top-ranked genes are selected to form a list of key biomarkers.

[0047] In this embodiment of the invention, in the initial data With a baseline vector Construct a linear interpolation path between (such as the zero vector or the median vector of all gene expression) and the points on the path are generated by linear interpolation: The proportionality coefficient The value range is from 0 to 1; for each interpolation point Using attribution objective function Perform one model forward propagation and compute The gradient of the input feature at each interpolation point is calculated through backpropagation, and the gradient value is accumulated.

[0048] Calculate the average gradient along the path and combine it with the initial data. With baseline vector The difference between them yields the integral gradient value of each feature: , where N is the number of interpolation steps.

[0049] The absolute attribution value of each feature is calculated by taking the absolute value of its integral gradient and averaging the results. This average attribution value measures the influence of each feature on the model's clustering output. The Top 10 results are then visualized using a bar chart, as shown in the reference figure. Figure 6 As shown.

[0050] Example 2 This embodiment is the training method of Embodiment 1, which may include the following steps: Step S1: Obtain cancer gene expression data from public databases, and perform preprocessing such preprocessing including transposition, numerical cleaning, and standardization to determine the number of target clusters. This includes the following steps: Step S11: The basic data obtained includes gene expression data from different omics of all cancer cells in the experiment; Step S12: Perform numeric type conversion and missing value imputation on the data; Step S13: Fill in the remaining missing values ​​using the median; Step S14: Perform gene alignment on the processed genomic data from different omics and the data used for independent validation; Step S15: Standardize the cell gene expression data to obtain initial data; Step S16: Use WSS (intra-cluster sum of squares) elbow plot to determine the number of clusters for the initial data.

[0051] Step S2: Extract the initial data obtained in Step S1 using a denoising Transformer encoder, construct a contrastive learning view using adaptive noise addition and masking techniques, and collaboratively optimize the contrastive learning loss and deep embedding clustering loss using an adaptive weighting mechanism, specifically: Step S21: Construct an adaptive noise-adding data augmentation module to generate two independent augmented views as positive sample pairs for the input gene expression data samples; each view is augmented by applying Gaussian noise and random masking, where the standard deviation σ of the Gaussian noise ranges from 0.05 to 0.2, the random masking ratio r ranges from 5% to 15%, and the feature values ​​at the masked positions are set to zero.

[0052] The adaptive noise-adding data augmentation module constructed in step S21 generates multi-view inputs with discriminative differences by designing an adaptive noise-adding data augmentation mechanism to improve the robustness and feature discrimination ability of the model. Specifically, it includes the following steps: Step S211: Process the input gene expression matrix An enhanced view is generated through two independent samplings. and ; Step S212: During each sampling process, apply random Gaussian noise perturbation to the original features: The noise standard deviation To enhance the model's tolerance to weak feature perturbations, a random masking strategy is employed to mask some feature information, i.e., a proportion is randomly selected for each sample. Set the feature dimensions to zero: ; Step S213: The two sets of enhanced views generated and The positive sample pairs are input to the subsequent encoder so that the model can capture feature invariance through contrastive learning to constrain the model.

[0053] Step S22: Construct a denoising Transformer encoder module, which includes an input linear projection layer, a positional encoding layer, a multi-layer Transformer encoding block, and an embedding normalization layer; the input linear projection layer projects high-dimensional gene features from the feature dimension F to the model dimension. , to obtain projection features The positional encoding layer uses sine and cosine positional encoding to add positional information to the input sequence; each layer of the multi-layer Transformer encoding block contains a multi-head self-attention mechanism and a feedforward neural network, and uses residual connections and layer normalization to ensure training stability; the embedding normalization layer performs L2 normalization on the features of the final output to obtain a robust embedding representation of the sample. The denoising Transformer encoder module constructed according to step S22 aims to map high-dimensional gene features to a low-dimensional embedding space and extract robust and semantically rich gene expression feature representations. Specifically, it includes the following structural components and computational processes: Step S221: Input linear projection layer will input feature dimensions Projected onto the model dimension via a linear mapping ; Step S222: To ensure the preservation of sequence order information, the position coding layer adds sine-cosine position coding to the input. ; Step S223: Each layer of the multi-layer Transformer encoding block consists of a multi-head self-attention mechanism and a feedforward neural network, and training stability is improved through residual connections and layer normalization. Its self-attention calculation is as follows: ; Step S224: Embedding Normalization Layer performs L2 normalization on the embedding vector output by the Transformer. The robust embedding representation of the samples is then obtained and used for subsequent contrastive learning and clustering optimization.

[0054] Step S23: Construct a contrastive learning loss function module, using normalized temperature-scaled cross-entropy loss. For each sample in a batch, calculate the cosine similarity between its two augmented view embedding vectors and form negative sample pairs with the embedding vectors of other samples.

[0055] The constructed contrastive learning loss function module is used to build similarity constraints between samples, ensuring that different augmented views of the same sample remain close in the embedding space, while different samples remain separate. This applies to batch sizes of... The input samples, each with two augmented view embedding vectors and Calculate the cosine similarity between any two vectors. And it uses the normalized temperature-scaled cross-entropy (NT-Xent) loss function:

[0056] in is the temperature hyperparameter, and N is the batch size; by using the contrastive learning loss, the model can learn to distinguish the embedding differences between different samples, while enhancing the feature consistency of the same sample under different perturbations, thereby improving the discriminativeness and stability of feature representation.

[0057] Step S24: Construct a deep embedding clustering loss function module. During training, perform online clustering on the sample embedding representations of each batch. Calculate the distance from the sample embedding to its corresponding cluster centroid using the K-Means algorithm. The clustering loss uses mean squared error loss. ,in Let be the mean of the embedding vectors of the two views for sample i. Let B be the centroid vector of the cluster to which sample i belongs, and B be the batch size.

[0058] The constructed deep embedding clustering loss function module guides sample aggregation to the optimal cluster centers in the embedding space, achieving integrated optimization of feature learning and clustering structure. During training, online K-Means clustering is performed on the embedding vectors of the current batch of samples to obtain the set of cluster centroids. For each sample i in the set, calculate the mean representation of its two view embeddings. Define the mean squared error (MSE) loss of each sample relative to its cluster centroid, and average it over the batch as the clustering loss:

[0059] in Let B be the centroid vector of the cluster to which sample i belongs, and let B be the batch size. This module embeds the clustering objective into the encoder training, ensuring that the embedding simultaneously satisfies both contrastive invariance and cluster structure consistency, thus improving the stability and interpretability of downstream clustering / classification. Furthermore, online centroid updates can adapt to the evolution of the embedding distribution in real time, preventing the training from becoming disconnected from the clustering objective.

[0060] Step S25: Construct an adaptive weighted joint optimization module, which combines the contrastive learning loss and the deep embedding clustering loss using adaptive weight coefficients. The total loss function is... ,in and It ranges from 0.1 to 2.0 and is used to dynamically balance the relative contributions of the two loss terms to the model parameter updates.

[0061] The adaptive weighted joint optimization module combines the contrastive learning loss and the deep embedding clustering loss using dynamic weighting coefficients to achieve synergistic optimization of feature consistency learning and clustering structure constraints. In this module, the contrastive learning loss is set as follows: The loss of deep clustering is The total loss function is defined as:

[0062] in, and These are adaptive weighting coefficients, ranging from 0.1 to 2.0, used to dynamically balance the relative impact of the two loss terms on the model training process. During training, the system automatically adjusts the weighting coefficients based on the gradient trends and convergence rates of the two loss terms. This allows the model to focus on contrastive learning of feature consistency constraints in the early stages to stabilize the embedding space distribution; and to gradually increase the weight of clustering constraints in the later stages to enhance intra-class compactness and inter-class separability.

[0063] Through this adaptive weighting mechanism, the model can effectively improve clustering quality and feature discriminativeness while maintaining the semantic expressive power of the embedding space, achieving the joint optimality of feature representation and structural distribution.

[0064] Step S26: Optimize the joint loss function end-to-end using the backpropagation algorithm, while updating the parameters of the denoising Transformer encoder and the online clustering centers to achieve collaborative optimization of feature learning and clustering objectives.

[0065] Step S3: Based on the model obtained in Step S2, attribution analysis is performed on the clustering results using integral gradient technology to quantify the contribution of each gene feature to the classification of the sample into a specific subtype, thereby identifying and ranking the key biomarkers corresponding to each subtype. This specifically includes the following steps: Step S31: Based on the input sample Compared with baseline samples The difference between them is used to construct a linear interpolation path from the baseline to the target input, for each interpolation scaling factor. (Values ​​range from 0 to 1) to generate intermediate samples ; Step S32: For each interpolation point, input the sample into the model, obtain its embedding representation vector, and define a scalar output based on the squared negative distance. The gradient of the input feature at each interpolation point is calculated through backpropagation, and the gradient value is accumulated. Step S33: Calculate the average gradient along the path and combine it with the input samples. Compared with baseline samples The difference between them yields the integral gradient value of each feature: Where N is the number of interpolation steps; Step S34: Take the absolute value of the integral gradient result for each sample and average it to obtain the average absolute attribution value of each feature, which is used to measure the influence of the feature on the clustering output of the model. Then, display the Top 10 results through a visual bar chart.

[0066] The KM survival curve plotted based on the clustering labels using this method is shown below. Figure 2 , 3 As shown in Figures 4 and 5.

[0067] The subtype labels obtained using this method are combined with patient survival data to validate the clinical significance of subtypes. Kaplan-Meier survival analysis is used to assess the correlation between subtyping results and prognosis. A survival difference significance test module is constructed, and the Log-rank test is used to assess the differences in survival distribution among subtypes. The formula for calculating the test statistic is as follows: , among which For the first The number of deaths in the first group at each point in time. For the first Expected number of deaths at a given point in time. For the first The variances at each time point were analyzed, and survival curves were plotted for each cancer molecular subtype. Based on the survival analysis results, the p-value of the Log-rank test was observed to see if it was less than 0.05. The results show that the method of this invention performs excellently on various cancer datasets. The p-value for two clusters on the MESO dataset is 0.0008756, for three clusters on the MESO dataset is 0.0004235, for three clusters on the READ dataset is 0.0068342, and for four clusters on the UCEC dataset is 0.0086832. The p-values ​​of all KM curves are much less than 0.05, indicating statistically significant differences in survival prognosis among different cluster subtypes. This verifies that the method of this invention exhibits survival heterogeneity in patient subtypes obtained by clustering multi-omics features, and can identify patient subtypes with significant survival differences, providing a basis for clinical prognostic stratification.

[0068] Therefore, it can be seen that using this method has at least the following advantages: First, this method innovatively constructs an unsupervised learning framework by co-training contrastive learning and deep embedding clustering through an adaptive weighting mechanism. This framework completely changes the traditional two-stage model where "feature extraction" and "cluster analysis" are separated. It makes the learning process of feature representation directly guided by the final clustering goal, which fundamentally improves clustering performance and overall model efficiency. Second, this method incorporates clinical survival information in both the data processing and results validation stages. In the feature engineering stage, survival data is used to screen key genes highly correlated with prognosis, ensuring the clinical relevance of features from the outset. In the analysis stage, rigorous Kaplan-Meier survival analysis and Log-rank tests are used to validate the subtyping results. This ensures that the identified molecular subtypes are not merely mathematical clusters, but clinical entities with significant prognostic differences, providing a direct and reliable basis for patient risk stratification and treatment planning. Third, by introducing a noisy data augmentation and denoising Transformer encoder, the model is forced to learn robust features in the data, exhibiting strong tolerance to common real-world data noise, batch effects, and missing values. Simultaneously, the Transformer architecture effectively captures complex nonlinear relationships and long-range dependencies between high-dimensional gene features, with representation capabilities far exceeding linear methods such as PCA, thus enabling the extraction of highly discriminative molecular features from massive amounts of gene data. Fourth, this method effectively addresses the black-box characteristics by incorporating integral gradient technology, enabling a clear quantification of the contribution of each gene to the typing decision. It not only provides clustering labels but also outputs key biomarkers that are both global and subtype-specific.

[0069] Example 3 Please see Figure 7 , Figure 7 This is a schematic diagram of a cancer subtype identification system based on adaptive denoising and contrastive learning, as disclosed in an embodiment of the present invention. Figure 7 The described system can be applied to corresponding identification terminals, identification devices, or servers, and the server can be a local server or a cloud server; this embodiment of the invention does not limit the application. Figure 7 As shown, the system may include: The acquisition module 100 is used to acquire gene expression data of the target patient and perform standardized preprocessing on the gene expression data to obtain initial data. Extraction module 200 is used to extract features from the initial data through a pre-trained denoising Transformer encoder to obtain a low-dimensional feature embedding vector. The determination module 300 is used to calculate the similarity between the low-dimensional feature embedding vector and each cluster center in the preset cluster center set, and to determine the target cancer subtype label corresponding to the target patient based on the mapping relationship corresponding to the cluster center with the highest similarity. Attribution analysis module 400 is used to perform attribution analysis on the decision-making process of classifying target patients into target cancer subtype labels based on integral gradient technology, identify and rank the genes that contribute the most to the decision, and obtain a list of key biomarkers.

[0070] As can be seen, the method described in the embodiments of the present invention can ensure data quality through standardized preprocessing, robustly extract deep molecular features using a pre-trained denoising Transformer encoder, and achieve reliable cancer subtype identification based on accurate matching with known subtype templates. Finally, with the help of interpretable attribution analysis technology, it not only outputs subtype diagnostic conclusions but also simultaneously provides a list of key genes driving the diagnosis. Thus, in a single analysis, it simultaneously completes accurate subtyping, basic judgment of prognostic associations, and the discovery of potential therapeutic targets, significantly improving the efficiency, accuracy, clinical relevance, and decision interpretability of cancer molecular subtyping, and providing strong data-driven support for personalized medicine.

[0071] In an optional implementation, the acquisition module 100 performs standardized preprocessing on the gene expression data to obtain initial data, including the following steps: The gene expression data was converted to a numerical type and missing values ​​were filled using the median to obtain the processed data; The processed data is aligned with the pre-defined standardized gene panel using gene markers to obtain aligned data. The aligned protective gear is standardized to obtain initial data.

[0072] In this embodiment of the invention, the gene expression data obtained from the target patient is converted into a numerical type, converting the original gene expression data, which may contain non-numerical characters, into a uniform floating-point type, and filling missing values ​​with the median of the gene expression level in all samples; for the filled gene expression data, gene label alignment is performed using a preset standardized gene panel, which is the gene panel used for independent validation during training; finally, the aligned gene expression data is standardized to obtain initial data.

[0073] In an optional implementation, the extraction module 200 extracts features from the initial data using a pre-trained denoising Transformer encoder to obtain a low-dimensional feature embedding vector, including the following steps: The initial data is mapped to the model dimension through the input linear projection layer to obtain projected features; Add positional encoding information to the projected features to obtain projected features with added positional encoding information; The projected features with added position encoding information are input into at least one Transformer encoding block for processing to obtain the processed features; The processed features are vector normalized to obtain low-dimensional feature embedding vectors.

[0074] In this embodiment of the invention, the denoising Transformer encoder includes an input linear projection layer, a position coding layer, a multi-layer Transformer coding block, and an embedding normalization layer.

[0075] Input linear projection layer for initial data From the feature dimension Projected onto model dimension To obtain projection features The calculation process is as follows: ; The positional coding layer uses sine and cosine functions to generate positional coding information, which is then added to the projected features to preserve sequence order information, resulting in projected features with added positional coding information. The calculation formula is as follows:

[0076] The projected features, with added positional encoding information, are input into multiple concatenated Transformer encoding blocks to obtain processed features. Each encoding block contains a multi-head self-attention layer and a feedforward neural network layer, and residual connections and layer normalization are used for stable training. The self-attention is calculated as follows: ; The embedded normalization layer performs L2 normalization on the processed features. This yields low-dimensional feature embedding vectors.

[0077] In an optional implementation, the determining module 300 calculates the similarity between the low-dimensional feature embedding vector and each cluster center in a preset set of cluster centers, and determines the target cancer subtype label corresponding to the target patient based on the mapping relationship corresponding to the cluster center with the highest similarity, including the following steps: Calculate the similarity between the low-dimensional feature embedding vector and each cluster center in the preset cluster center set to obtain a set of similarity values; The maximum value is determined from a set of similarity values, and the cluster center corresponding to the maximum value is determined as the matching cluster center; Query the predefined mapping table between cluster centers and cancer subtype labels, obtain the cancer subtype label that uniquely corresponds to the matching cluster center, and use it as the target cancer subtype label.

[0078] In this embodiment of the invention, the cosine similarity between the low-dimensional feature embedding vector and each cluster center in a preset set of cluster centers is calculated. This yields a set of similarity values; The mapping table is established during the model training phase. It is built by analyzing the clinical characteristics of each cluster. Based on the clustering results of historical patient data and clinical information analysis, each cluster center is assigned a corresponding cancer subtype label.

[0079] In an optional implementation, the attribution analysis module 400, based on integral gradient technology, performs attribution analysis on the decision-making process of classifying target patients into target cancer subtype labels, identifies and ranks several genes that contribute the most to the decision, and obtains a list of key biomarkers, including the following steps: A linear interpolation path is constructed between the initial data and a baseline input, and multiple interpolation points are sampled along the path; For each interpolation point, the attribution objective function is used. The gradient of its input features is calculated, where For denoising, use a Transformer encoder, where x is the initial data and centroid is the cluster center with the highest similarity. Integrate all gradients calculated by the attribution objective function along the interpolation path to obtain the integral gradient values ​​corresponding to each gene feature; Genes are sorted according to the absolute value of the integral gradient corresponding to each gene feature, and the top-ranked genes are selected to form a list of key biomarkers.

[0080] In this embodiment of the invention, in the initial data With a baseline vector Construct a linear interpolation path between (such as the zero vector or the median vector of all gene expression) and the points on the path are generated by linear interpolation: The proportionality coefficient The value range is from 0 to 1; for each interpolation point Using attribution objective function Perform one model forward propagation and compute The gradient of the input feature at each interpolation point is calculated through backpropagation, and the gradient value is accumulated.

[0081] Calculate the average gradient along the path and combine it with the initial data. With baseline vector The difference between them yields the integral gradient value of each feature: , where N is the number of interpolation steps.

[0082] The absolute attribution value of each feature is calculated by taking the absolute value of its integral gradient and averaging the results. This average attribution value measures the influence of each feature on the model's clustering output. The Top 10 results are then visualized using a bar chart, as shown in the reference figure. Figure 6 As shown.

[0083] Example 4 Please see Figure 8 , Figure 8 This is a schematic diagram of the structure of a cancer subtype identification device based on adaptive denoising and contrastive learning, as disclosed in an embodiment of the present invention. Figure 8 As shown, the device may include: Memory 501 storing executable program code; Processor 502 coupled to memory 501; The processor 502 calls the executable program code stored in the memory 501 to execute some or all of the steps in the cancer subtype identification method based on adaptive denoising and contrastive learning disclosed in Embodiment 1 or Embodiment 2 of the present invention.

[0084] Example 5 This invention discloses a computer storage medium storing computer instructions. When these computer instructions are invoked, they are used to execute some or all of the steps in the cancer subtype identification method based on adaptive denoising and contrastive learning disclosed in Embodiment 1 or Embodiment 2 of this invention.

[0085] The device embodiments described above are merely illustrative. The modules described as separate components may or may not be physically separate. The components shown as modules may or may not be physical modules; that is, they may be located in one place or distributed across multiple network modules. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Those skilled in the art can understand and implement this without any creative effort.

[0086] Through the detailed description of the above embodiments, those skilled in the art can clearly understand that each implementation method can be implemented by means of software plus necessary general-purpose hardware platforms, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solutions, in essence or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product can be stored in a computer-readable storage medium, including read-only memory (ROM), random access memory (RAM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), one-time programmable read-only memory (OTPROM), electrically-Erasable Programmable Read-Only Memory (EEPROM), compact disc read-only memory (CD-ROM) or other optical disc storage, disk storage, magnetic tape storage, or any other computer-readable medium that can be used to carry or store data.

[0087] Finally, it should be noted that the cancer subtype identification method and system based on adaptive denoising and contrastive learning disclosed in the embodiments of the present invention are merely preferred embodiments of the present invention and are only used to illustrate the technical solutions of the present invention, not to limit it. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A cancer subtype identification method based on adaptive denoising and contrastive learning, characterized in that, The method includes the following steps: Gene expression data of the target patient is obtained, and the gene expression data is standardized and preprocessed to obtain initial data; The initial data is processed by a pre-trained denoising Transformer encoder to extract features and obtain a low-dimensional feature embedding vector. Calculate the similarity between the low-dimensional feature embedding vector and each cluster center in the preset cluster center set, and determine the target cancer subtype label corresponding to the target patient based on the mapping relationship corresponding to the cluster center with the highest similarity. Based on integral gradient technology, attribution analysis is performed on the decision-making process of classifying the target patient into the target cancer subtype label, and the genes that contribute the most to the decision are identified and ranked to obtain a list of key biomarkers.

2. The cancer subtype identification method based on adaptive denoising and contrastive learning according to claim 1, characterized in that, The standardization preprocessing of the gene expression data to obtain initial data includes the following steps: The gene expression data is converted to a numerical type and missing values ​​are imputed using the median to obtain processed data; The processed data is aligned with a preset standardized gene panel using gene identifiers to obtain aligned data; The aligned protective gear is standardized to obtain initial data.

3. The cancer subtype identification method based on adaptive denoising and contrastive learning according to claim 1, characterized in that, The step of extracting features from the initial data using a pre-trained denoising Transformer encoder to obtain a low-dimensional feature embedding vector includes the following steps: The initial data is mapped to the model dimension through an input linear projection layer to obtain projected features; Add position encoding information to the projection feature to obtain the projection feature with added position encoding information; The projected features with added position encoding information are input into at least one Transformer encoding block for processing to obtain the processed features; The processed features are vector normalized to obtain low-dimensional feature embedding vectors.

4. The cancer subtype identification method based on adaptive denoising and contrastive learning according to claim 3, characterized in that, The Transformer coding block includes a multi-head self-attention mechanism and a feedforward neural network.

5. The cancer subtype identification method based on adaptive denoising and contrastive learning according to claim 1, characterized in that, The process of calculating the similarity between the low-dimensional feature embedding vector and each cluster center in a preset set of cluster centers, and determining the target cancer subtype label corresponding to the target patient based on the mapping relationship corresponding to the cluster center with the highest similarity, includes the following steps: Calculate the similarity between the low-dimensional feature embedding vector and each cluster center in the preset cluster center set to obtain a set of similarity values; The maximum value is determined from the set of similarity values, and the cluster center corresponding to the maximum value is determined as the matching cluster center; Query the preset mapping table between cluster centers and cancer subtype labels, obtain the cancer subtype label that uniquely corresponds to the matching cluster center, and use it as the target cancer subtype label.

6. The cancer subtype identification method based on adaptive denoising and contrastive learning according to claim 5, characterized in that, The mapping table is established during the model training phase. Based on the clustering results of historical patient data and clinical information analysis, each cluster center is assigned a corresponding cancer subtype label.

7. The cancer subtype identification method based on adaptive denoising and contrastive learning according to claim 1, characterized in that, The attribution analysis of the decision-making process for classifying the target patient into the target cancer subtype label based on integral gradient technology, identifying and ranking the genes that contribute the most to the decision, and obtaining a list of key biomarkers, includes the following steps: A linear interpolation path is constructed between the initial data and a baseline input, and multiple interpolation points are sampled along the path; For each interpolation point, the attribution objective function is used. The gradient of its input features is calculated, where For denoising, use a Transformer encoder, where x is the initial data and centroid is the cluster center with the highest similarity. Integrate all gradients calculated through the attribution objective function along the interpolation path to obtain the integral gradient value corresponding to each gene feature; Genes are sorted according to the absolute value of the integral gradient corresponding to each gene feature, and the top-ranked genes are selected to form the list of key biomarkers.

8. A cancer subtype identification system based on adaptive denoising and contrastive learning, characterized in that, The system includes: The acquisition module is used to acquire gene expression data of the target patient and perform standardized preprocessing on the gene expression data to obtain initial data. The extraction module is used to extract features from the initial data using a pre-trained denoising Transformer encoder to obtain a low-dimensional feature embedding vector. The determination module is used to calculate the similarity between the low-dimensional feature embedding vector and each cluster center in the preset cluster center set, and determine the target cancer subtype label corresponding to the target patient based on the mapping relationship corresponding to the cluster center with the highest similarity. The attribution analysis module is used to perform attribution analysis on the decision-making process of classifying the target patient into the target cancer subtype label based on integral gradient technology, identify and rank the genes that contribute the most to the decision, and obtain a list of key biomarkers.

9. A cancer subtype identification device based on adaptive denoising and contrastive learning, characterized in that, The device includes: Memory containing executable program code; A processor coupled to the memory; The processor calls the executable program code stored in the memory to execute the cancer subtype identification method based on adaptive denoising and contrastive learning as described in any one of claims 1-7.

10. A computer storage medium, characterized in that, The computer storage medium stores computer instructions, which, when invoked, are used to execute the cancer subtype identification method based on adaptive denoising and contrastive learning as described in any one of claims 1-7.