An incomplete multi-omics cancer subtype identification method, system, device and medium

By constructing a granular feature module and a module-level skeleton representation, and combining it with a view availability mask for cross-omics consensus alignment and adaptive weighting, the problem of structural bias and missing pattern influence in incomplete multi-omics data is solved, thereby improving the stability and biological consistency of cancer subtype identification.

CN122266484APending Publication Date: 2026-06-23JIANGNAN UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
JIANGNAN UNIV
Filing Date
2026-05-27
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing multi-omics cancer subtype identification methods are prone to structural bias in scenarios with incomplete data, insufficient preservation of internal omics structure, and susceptibility of cross-omics fusion to missing patterns, resulting in insufficient stability and biological consistency of subtype results.

Method used

By constructing a granular feature module, extracting module-level skeleton representations, and performing entry-level missing item recovery based on the skeleton structure, and combining view availability masks for cross-omics consensus alignment and adaptive view weighting, robust integration of omics and internal and external systems is achieved.

Benefits of technology

Prioritizing the preservation of the intrinsic structural consistency of multi-omics data under missing and noisy conditions improves the robustness and biological consistency of cancer subtype identification results, thereby enhancing their clinical application value.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122266484A_ABST
    Figure CN122266484A_ABST
Patent Text Reader

Abstract

The application discloses an incomplete multi-omics cancer subtype identification method, system, device and medium, and belongs to the technical field of bioinformatics and artificial intelligence. The method comprises the following steps: incomplete multi-omics data acquisition and preprocessing; constructing an entry-level observation mask matrix and a view availability mask matrix; constructing a feature module in the omics based on a granulocyte division; extracting a module-level skeleton representation; recovering an entry-level missing value based on the skeleton structure; constructing a multi-expert skeleton recovery integrated result; constructing a central feature matrix of the feature module; constructing a cross-omics consensus structure space; performing a mask-aware skeleton consensus alignment; performing adaptive view weighting based on structure reliability; performing sample-level mask-aware fusion; and outputting a cancer subtype clustering result. The application provides reliable technical support for cancer typing research, patient stratification analysis, prognosis evaluation and precision medicine auxiliary decision-making.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of bioinformatics, artificial intelligence, and multi-omics data analysis, specifically to a method, system, device, and medium for incomplete multi-omics cancer subtype identification. Background Technology

[0002] Cancer is a complex disease characterized by high molecular heterogeneity. Even when originating from the same tissue or organ, different patients may exhibit significant differences in gene expression, DNA methylation, miRNA regulation, tumor microenvironment, and clinical prognosis. Identifying molecular subtypes in cancer patients can reveal the underlying mechanisms of tumor development and progression, providing crucial information for patient stratification, prognostic assessment, and personalized treatment. With the advancement of high-throughput sequencing technology, multi-omics data, such as mRNA expression, DNA methylation, and miRNA expression data, can be simultaneously obtained from the same batch of patient samples, thus providing an important foundation for multidimensional characterization of cancer status.

[0003] Existing multi-omics cancer subtype identification methods generally fall into three categories: early ensemble, late ensemble, and mid-stage ensemble. Early ensemble methods typically concatenate features from different omics and then perform clustering or dimensionality reduction; the process is simple but susceptible to high-dimensional noise, feature redundancy, and scale differences. Late ensemble methods usually perform independent modeling within each omics before fusing the results; while retaining some omics specificity, their cross-omics joint modeling capability is insufficient. Mid-stage ensemble methods achieve shared structure learning through latent spaces, similar networks, or matrix factorization, achieving good results in complete multi-omics scenarios. However, most of these methods assume that the multi-omics data is complete or only model under weak missing data assumptions, making it difficult to fully adapt to the complex missing data patterns in real-world clinical scenarios.

[0004] In real-world clinical studies, multi-omics data are often incomplete. On the one hand, due to sequencing costs, sample quality, differences in detection platforms, or limitations of experimental batches, some patients may lack measurements of a particular omics category, resulting in block missing data at the omics level. On the other hand, even if a particular omics has been measured, due to insufficient detection sensitivity, experimental noise, quantification instability, or filtering operations during data processing, some molecular features may still be missing within the omics, resulting in entry missing data. These two types of missing data usually coexist in the same dataset and jointly affect the similarity between samples, the stability of the internal structure of omics, and cross-omics consistency, thereby increasing the difficulty of cancer subtype identification.

[0005] For incomplete multi-omics data, existing methods typically employ two approaches. The first approach follows a "fill-in, then fuse" strategy, first imputing missing values ​​using means, matrix completion, or model prediction to restore them, then using the imputed omics data for cluster analysis. The second approach directly addresses scenarios with partially missing views, integrating multiple omics through shared latent representations, similar network enhancement, or incomplete multi-view learning. While these methods can mitigate the impact of missing data to some extent, they still have significant shortcomings. First, simple numerical imputation often focuses only on the accuracy of restoring individual missing values, making it difficult to guarantee that the restored features still conform to the original omics structure, easily introducing spurious correlations and disrupting the intrinsic consistency between real feature modules. Second, partially incomplete fusion methods primarily focus on the combination relationships of available views at the sample level, lacking explicit modeling of the internal feature structure of omics, resulting in unstable learned representations when high-dimensional noise and missing entries coexist. Third, in the cross-omics fusion stage, if there is no sample-level constraint on the number of available omics for different patient samples, samples with different missing patterns are prone to scale bias and fusion bias, which in turn affects the stability and clinical interpretability of the clustering results.

[0006] Furthermore, existing incomplete multi-omics or incomplete multi-view clustering methods mostly focus on sample-level relationship recovery, graph structure completion, or shared representation learning, rarely explicitly inducing stable structural representations at the feature level. For multi-omics data, high-dimensional features are not completely independent but often exhibit modularity, co-variation, and locally homogeneous structures. If these structures are ignored during the missing data recovery process, and only the missing values ​​are estimated independently, the key structural patterns upon which subsequent fusion and clustering depend may be weakened. Especially when both entry missing and block missing conditions exist, simply recovering values ​​without prioritizing structure often makes it difficult to guarantee the unbiasedness of cross-omics fusion results and the reliability of subtype classification results.

[0007] Therefore, there is an urgent need for an incomplete multi-omics cancer subtype identification method that can preferentially maintain structural stability under conditions of dual missing data. This method should be able to induce robust module-level structural representations from the feature layer within the omics framework and complete entry-level missing data recovery under this structural constraint. Simultaneously, it should be able to utilize view availability masks for skeleton consensus alignment, perform adaptive view weighting based on structural reliability, and perform sample-level mask-aware fusion at the cross-omics level, thereby simultaneously handling both entry-missing and block-missing loss patterns and improving the robustness, biological consistency, and clinical application value of cancer subtype classification results. Summary of the Invention

[0008] This invention aims to provide a method, system, device, and medium for identifying incomplete multi-omics cancer subtypes, assisting in cancer molecular subtyping, patient stratification analysis, and prognostic assessment. Addressing the problems of existing multi-omics cancer subtype identification methods in incomplete data scenarios, such as numerical imputation easily introducing structural bias, insufficient preservation of internal omics structure, susceptibility to missing patterns in cross-omics fusion, and limited stability of subtype results, this invention proposes a structure-first incomplete multi-omics cancer subtype identification framework. First, this invention constructs granular-spherical feature modules within each omics and extracts the module mean skeleton, the module's first principal component skeleton, and standardized skeleton coordinates to obtain a low-dimensional, robust module-level skeleton representation. Then, using the standardized skeleton coordinates as structural explanatory variables, entry-level missing data recovery is performed based on the skeleton structure, and the stability of the recovery results is improved through multi-expert weighted integration. In the cross-omics fusion stage, this invention further constructs a cross-omics consensus structure space and combines view availability masks to perform mask-aware skeleton consensus alignment, adaptive view weighting based on structural reliability, and sample-level mask-aware fusion, thereby achieving unbiased integration of omics-level overall missing samples. Through the above method, the present invention can simultaneously handle two types of incomplete multi-omics missing patterns within the same framework: missing entries within omics and overall missing omics at the omics level. Under conditions of missing and noise, it prioritizes maintaining the intrinsic structural consistency of multi-omics data, thereby improving the robustness, stability, biological consistency, and clinical application value of cancer subtype identification results.

[0009] The technical solution of the present invention:

[0010] An incomplete multi-omics cancer subtype identification method includes the following steps:

[0011] Step 1: Acquisition and preprocessing of incomplete multi-omics data: Acquire multi-omics data from the same batch of patient samples to form a matrix set of incomplete multi-omics data, and perform patient sample ID alignment, abnormal sample removal, missing marker identification, column-level normalization, and scale unification on each omics data to obtain a matrix set of preprocessed incomplete multi-omics data.

[0012] Step 2: Constructing Entry-Level Observation Mask Matrix and View Availability Mask Matrix: An entry-level observation mask matrix is ​​constructed for each omics, and is used to determine the set of observation samples and the missing locations in the subsequent entry-level missing value recovery process. The entry-level missing value refers to the missing value of some patient samples in certain features within a certain omics, and its missing location is determined by the entry-level observation mask matrix. A view availability mask matrix is ​​also constructed to mark the availability status of each patient sample corresponding to each omics, and is used for subsequent mask-aware skeleton consensus alignment and sample-level mask-aware fusion.

[0013] Step 3: Construct feature modules within omics based on granular sphere partitioning: For each omics, temporarily fill in missing entries based on the median of the observed values ​​of each feature column to obtain a temporary fill matrix without missing values; treat each feature as a feature vector in the sample dimension, construct a feature embedding vector, and perform recursive granular sphere partitioning based on the directional consistency between features to obtain feature modules within the omics.

[0014] Step 4: Extract module-level skeleton representation: For each feature module in each omics, extract sample-level skeleton variables within the feature module to obtain the module-level skeleton representation. The module-level skeleton representation includes the module mean skeleton, the module's first principal component skeleton, and the standardized skeleton coordinates.

[0015] Step 5: Item-level missing value recovery based on skeleton structure: For each feature module, the set of observation samples and the item-level missing value position of each feature in the feature module are determined according to the item-level observation mask matrix constructed in Step 2. The standardized skeleton coordinates corresponding to the feature module are used as structural explanatory variables. A structural regression design matrix is ​​established for each feature in the feature module, and the missing values ​​corresponding to the item-level missing value position are recovered to obtain the candidate recovery matrix.

[0016] Step 6: Construct multi-expert skeleton restoration ensemble results: Construct multiple skeleton restoration experts, each of whom independently performs the particle-sphere partitioning, skeleton induction, and structural constraint restoration process to obtain candidate completion matrices; evaluate the restoration errors of different skeleton restoration experts based on the pseudo-missing validation set, and perform weighted ensemble of candidate completion matrices according to the restoration errors to obtain the final completion matrices of each omics.

[0017] Step 7: Construct the central feature matrix of the feature module: For each feature module, calculate the centroid of the feature module in the feature embedding space, and select the feature closest to the centroid of the feature module as the central feature of the feature module; according to the index of the central feature, extract the corresponding central feature from the final completion matrix of each omics obtained in Step 6 to obtain the central feature matrix.

[0018] Step 8: Construct the cross-omics consensus structure space: Concatenate the module-level skeleton representations of all omics to construct the cross-omics original structure matrix, and perform dimensionality reduction on the cross-omics original structure matrix to obtain the cross-omics consensus structure space.

[0019] Step 9: Perform mask-aware skeleton consensus alignment: Initialize the view soft representation matrix based on the central feature matrix of each omics, construct the sample-level availability diagonal matrix according to the view availability mask matrix, reconstruct the view soft representation matrix of each omics to the cross-omics consensus structure space under mask-aware conditions, and update the soft representation of each omics by minimizing the structure alignment loss.

[0020] Step 10: Adaptive view weighting based on structural reliability: Calculate the structural complexity of each omics and convert the structural complexity into an intermediate score of view reliability. Update the view weights based on the intermediate score of view reliability to obtain the global view weights.

[0021] Step 11: Perform sample-level mask-aware fusion: For each patient sample, the global view weights are renormalized at the sample level according to the available omics set, and the updated soft representations of different omics are weighted and concatenated to obtain the final fusion embedding matrix.

[0022] Step 12: Output cancer subtype clustering results: Perform clustering on the final fusion embedding matrix to obtain cancer subtype labels, and perform survival analysis, clinical label enrichment test and visualization analysis based on the cancer subtype labels.

[0023] Furthermore, the specific steps for preprocessing the incomplete multi-omics data in step 1 are as follows:

[0024] Step 1.1, Patient Sample ID Alignment and Abnormal Sample Removal: Align each omics data according to the patient sample ID, and retain only the patient samples that can be matched at the sample level.

[0025] Step 1.2, Missing Item Identification: Identify missing items in each omics and mark them as missing values.

[0026] Step 1.3, Column-level normalization and scale unification: For non-negative feature columns with a scale greater than a preset threshold, perform a logarithmic transformation:

[0027]

[0028] Calculate the mean and standard deviation of each feature column based on the observed values, and perform column-level standardization:

[0029]

[0030] in, Indicates the first In the first omics The first patient sample The original observations of each feature, Represents the eigenvalues ​​after logarithmic transformation. Represents the standardized eigenvalues. Indicates the first In the first omics The mean of each feature on the observed sample Indicates the first In the first omics The standard deviation of each feature on the observed sample This represents a very small positive number that is protected against division by zero.

[0031] Furthermore, step 3 specifically includes the following steps:

[0032] Step 3.1: For each omics, temporarily fill in the missing entries. The first in omics Let there be features, and denote their feature vector along the sample dimension as the sample vector:

[0033]

[0034] Among them, temporary matrix completion Indicates the first In the granular-sphere partitioning stage, the matrix obtained by temporarily imputing missing entries using the median of observations is only used for feature embedding construction and granular-sphere partitioning, and is not considered as the final missing item recovery result. Indicates temporary matrix completion No. The first in omics One characteristic, Indicates the first In the first omics The sample vector corresponding to each feature This indicates the number of patient samples.

[0035] Step 3.2: Center the eigenvectors and Normalization yields the feature embedding vector:

[0036]

[0037] in, Indicates the first In the first omics Each feature-normalized feature embedding vector Represents sample vector The mean across the sample dimension, Representing vectors Norm.

[0038] Step 3.3: For any candidate feature set to be partitioned The internal mean cosine similarity is defined as:

[0039]

[0040] in, This indicates the number of features in the candidate feature set. Used to reflect the consistency of characteristic directions within a set. Indicates the first The characteristic dimensions of omics , All represent candidate feature sets Different features in Indicates the first In the first omics Each feature-normalized feature embedding vector Indicates the first In the first omics Each feature-normalized feature embedding vector Indicates the first The complete set of features in omics.

[0041] Step 3.4: To reduce the computational complexity of explicitly calculating pairwise mean cosine similarity, let:

[0042]

[0043] in, Let represent the sum of the feature embedding vectors in the candidate feature set. Then, the mean cosine similarity is equivalently calculated as:

[0044]

[0045] Step 3.5: Define the candidate feature set The average cosine distance is:

[0046]

[0047] in, The smaller the value, the more consistent the feature directions within the candidate feature set, and the stronger the structural homogeneity.

[0048] Step 3.6: When the candidate feature set When the homogeneity requirement is not yet met, for the candidate feature set Principal component analysis (PCA) is performed on the corresponding feature embedding submatrix, and it is divided into two subsets along the median of the first principal component score. and ,in and These represent the left and right sub-feature sets obtained from the partitioning, respectively.

[0049] Step 3.7: Define the weighted average cosine distance of subsets as:

[0050]

[0051] Step 3.8: Define the candidate feature set The splitting improvement rate is:

[0052]

[0053] in, Used to represent the candidate feature set The degree of improvement in structural homogeneity resulting from binary division.

[0054] Step 3.9: When When, it indicates that for the current set of candidate features The improvement in structural homogeneity brought about by continuing binary splitting is insufficient, so we stop further splitting the current candidate feature set and merge the current candidate feature set. Directly as a final feature module; when Then, continue processing the left and right sub-feature sets obtained from the partitioning. and Recursive partitioning; when the number of generated feature sets reaches the preset upper limit for the number of feature sets. Stop dividing when the time is right, and output the first... All feature sets obtained after recursive particle sphere partitioning of each omics are denoted as the final feature modules. .in, This represents the stopping threshold for particle division.

[0055] Furthermore, steps 4 to 6 specifically include the following steps:

[0056] Step 4.1: For the first The first in omics Each feature module The mean skeleton of the module is calculated based on the corresponding module submatrix:

[0057]

[0058] in, Indicates the first The patient sample in the first The first omics The module mean skeleton value on each feature module.

[0059] Step 4.2: Perform first principal component analysis on the module submatrix to obtain the first principal component skeleton of the module:

[0060]

[0061] in, This represents the operation of extracting the score of the first principal component of the matrix. Indicates the first The temporary completion matrix of the omics is composed of the th omics Each feature module The module submatrix composed of corresponding feature columns.

[0062] Step 4.3: Perform column-level normalization on the first principal component skeleton of the module to obtain the normalized skeleton coordinates:

[0063]

[0064] in, This indicates column-level standardized operations. Indicates the first The first omics Standardized skeleton coordinates of each feature module.

[0065] Step 5.1: Using the standardized skeleton coordinates as structural explanatory variables, construct a structural regression design matrix with an intercept term:

[0066]

[0067] Step 5.2: For feature modules Any feature Let its response vector be:

[0068]

[0069] in, Indicates the first In the first omics A column vector of values ​​for each feature across all patient samples. Indicated by The response vector is obtained as the response. Define features. The set of observed samples is:

[0070]

[0071] in, Indicates the first In the first omics The first patient sample The values ​​for each feature are observed states, not missing values.

[0072] Step 5.3: In the observed sample set Fitting ridge regression parameters:

[0073]

[0074] in, Represents the structural regression design matrix Extracting the observation sample set The submatrix obtained after the corresponding row, This represents the ridge regularization coefficient during the skeletal reconstruction phase. Represents the identity matrix. Represents the response vector In the observed sample set The value on, Indicates the first In the first omics The ridge regression parameters corresponding to each feature.

[0075] Step 5.4: Based on the entry-level observation mask matrix constructed in Step 2, determine the observation sample set and entry-level missing positions for each feature within this feature module. For entry-level missing positions... The corresponding missing values ​​are recovered according to the following formula:

[0076]

[0077] in, Indicates the first In the first omics The first patient sample The recovered value of each feature, Indicates the first In the first omics The ridge regression parameters corresponding to each feature Represents the structural regression design matrix The Middle The row vectors corresponding to each patient sample; the candidate recovery matrix is ​​composed of the recovered values ​​at all missing entry positions and the original observation values, and the candidate recovery matrix is ​​used for subsequent multi-expert skeleton recovery integration.

[0078] Step 6.1: Extract partial coordinates from the original observation locations to construct a pseudo-missing validation set, build multiple skeleton restoration experts, and each skeleton restoration expert performs particle-sphere partitioning, skeleton induction, and structural constraint restoration using different random perturbation or different temporary completion initialization methods to obtain the candidate completion matrix:

[0079]

[0080] in, Indicates the first In the first omics A candidate completion matrix obtained by a skeleton restoration expert.

[0081] Step 6.2: Evaluate the recovery errors of different skeleton restoration experts based on the pseudo-missing validation set, and calculate the ensemble weights of the skeleton restoration experts based on their recovery errors on the pseudo-missing validation set.

[0082]

[0083] in, Indicates the first In the first omics Integrated weights of individual skeletal restoration experts Indicates the first In the first omics The recovery error of a skeleton restoration expert on a pseudo-missing validation set. Indicates the first In the first omics The recovery error of a skeleton restoration expert on a pseudo-missing validation set.

[0084] Step 6.3: Perform element-wise weighted integration of the candidate completion matrices of each skeleton restoration expert based on the magnitude of the restoration error to obtain the... The final completion matrix of omics :

[0085]

[0086] Furthermore, steps 7 to 9 specifically include the following steps:

[0087] Step 7.1: For the first The first in omics Each feature module Calculate the centroids of all feature embedding vectors in this feature module:

[0088]

[0089] in, Indicates the first The first omics The centroid of the feature embedding vector of each feature module.

[0090] Step 7.2: Calculate the cosine similarity between each feature embedding vector and the centroid within the feature module, and select the feature with the highest cosine similarity as the central feature of the feature module.

[0091]

[0092] Step 7.3: Compile the indices of the central features of all feature modules into a central feature index vector, and then extract the vector from the completed matrix. Extracting central features from the data yields the central feature matrix. .

[0093] Step 8.1: Concatenate the module mean skeleton, module first principal component skeleton, and standardized skeleton coordinates of each omics according to their columns to obtain the cross-omics original structure matrix. .

[0094] Step 8.2: For the aforementioned cross-omics original structure matrix... Performing PCA dimensionality reduction yields a low-dimensional cross-omics consensus structure space. :

[0095]

[0096] in, This represents the dimension of the cross-omics consensus structure space.

[0097] Step 9.1: For the first Omics, central feature matrix Initialized as a non-negative, row-normalized view soft representation matrix .

[0098] Step 9.2: Construct the first [view availability mask matrix] based on the view availability mask matrix. The sample-level availability diagonal matrix of each omics:

[0099]

[0100] in, A sample-level view availability mask, used to represent the availability of the first-order view. Does the patient sample have the first Omics.

[0101] Step 9.3: Under mask-aware conditions, calculate the first... Omics in the cross-omics consensus structure space ridge structure reconstruction:

[0102]

[0103] in, Indicates the first The reconstruction results of omics in the consensus structure space This represents the ridge regularization coefficient during the structural alignment phase. Represents the identity matrix.

[0104] Step 9.4: Define the structure alignment loss to measure the difference between the view soft representation matrix and its reconstruction result in the cross-omics consensus structure space; construct an update direction based on the residual term corresponding to the structure alignment loss, and use this update direction to iteratively update the view soft representation matrix. The structure alignment loss is:

[0105]

[0106] The structural alignment residual gradient is:

[0107]

[0108] The soft representation is updated as follows:

[0109]

[0110] Step 9.5: Perform nonnegation and row-by-row normalization projection on the updated soft representation matrix to satisfy the nonnegation constraint and row normalization constraint.

[0111] Furthermore, steps 10 to 11 specifically include the following steps:

[0112] Step 10.1: For the first Omics-based computation of the soft representation of patient samples after availability masking:

[0113]

[0114] in, Indicates the first The soft representation of each omics after sample availability masking.

[0115] Step 10.2: Use truncated singular sums as the structural complexity of omics:

[0116]

[0117] in, Indicates the structural complexity of the v-th omics. express The first of the matrix A singular value, This indicates the number of truncated singular values.

[0118] Step 10.3: Standardize the structural complexity of all omics and convert it into an intermediate score for view reliability:

[0119]

[0120] in, Indicates the first The intermediate reliability score of the standardized view of omics. Represents the mean function, This represents the standard deviation function.

[0121] Step 10.4: Shift the intermediate score of view reliability to a non-negative score:

[0122]

[0123] in, This represents the non-negative score of the v-th group. Indicates the first The intermediate reliability score of the standardized view of omics. This represents the total number of omics studies.

[0124] Step 10.5: Update the view weights using power normalization to obtain the global view weights:

[0125]

[0126] in, This represents the view weight adjustment parameter. Views with lower structural complexity and greater structural stability receive higher reliability scores and thus greater weight in the final fusion.

[0127] Step 11.1: For the first For each patient sample, define its available omics set:

[0128]

[0129] in, Indicates the first A set of available omics indexes for each patient sample.

[0130] Step 11.2: Based on global view weight and sample-level view availability mask Construct sample-level normalized view weights:

[0131]

[0132] in, Indicates the first The patient sample corresponds to the first Sample-level normalized view weights for each omics.

[0133] Step 11.3: Weight the updated soft representations of each omics according to the sample-level normalized view weights, and concatenate them to obtain the... Fusion representation of individual patient samples:

[0134]

[0135] in, Indicates the first The patient sample in the first The updated soft representation vector in each omics. Indicates the first The updated soft representation of each omics is the result of weighting the sample-level normalized view.

[0136] Step 11.4: For Perform row normalization and stack the fused representations of all patient samples row by row to obtain the final fused embedding matrix:

[0137]

[0138] An incomplete multi-omics cancer subtype identification system includes: a data acquisition and preprocessing module, a mask construction module, a granular feature module construction module, a module-level skeleton extraction module, an entry-level missing item recovery module, a multi-expert skeleton recovery integration module, a module-centric feature representation construction module, a cross-omics consensus structure space construction module, a mask-aware skeleton consensus alignment module, a structure reliability adaptive view weighting module, a sample-level mask-aware fusion module, and a cancer subtype output module; each of the above modules has the function of implementing steps 1-12 above.

[0139] An electronic device includes a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that: when the processor executes the computer program, it implements the steps of the incomplete multi-omics cancer subtype identification method described above.

[0140] A computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the incomplete multi-omics cancer subtype identification method described above.

[0141] Compared with the prior art, the present invention has the following significant advantages:

[0142] (1) This invention addresses the problem of simultaneous missing entries within omics and overall missing entries at the omics level in real clinical multi-omics data, proposing a structure-priority incomplete multi-omics cancer subtype identification method. By constructing feature modules within omics through granular sphere partitioning and extracting module-level skeleton representations, high-dimensional omics features are organized into structurally homogeneous low-dimensional skeleton representations, thereby prioritizing the maintenance of internal structural consistency of data under missing and noisy conditions, and improving the stability and reliability of cancer subtype identification results.

[0143] (2) This invention uses the standardized skeleton coordinates corresponding to the feature modules as structural explanatory variables for item-level missing data recovery, and combines the pseudo-missing validation set to construct the multi-expert skeleton recovery integration result, so that the missing value recovery process is constrained by the module structure, avoiding the problems of pseudo-correlation and structural bias that are easily introduced by traditional numerical imputation methods, thereby improving the recovery quality of incomplete omics data and the subsequent clustering effect.

[0144] (3) This invention constructs a cross-omics consensus structure space and combines view availability masking, adaptive view weighting based on structural reliability, and sample-level mask-aware fusion to achieve unbiased multi-omics integration of samples with overall missing omics levels. As a result, the obtained cancer subtypes have good significance, biological consistency, and clinical application value in survival analysis, clinical label enrichment, and molecular expression pattern analysis. Attached Figure Description

[0145] Figure 1This is an overall flowchart of the incomplete multi-omics cancer subtype identification method described in this invention.

[0146] Figure 2 This is a schematic diagram of the structure of the cancer subtype identification algorithm described in this invention.

[0147] Figure 3 This is a heatmap of PAM50 gene expression obtained on the BRCA dataset in this invention.

[0148] Figure 4 This is a Kaplan-Meier survival curve plot for different cancer subtypes on the BRCA dataset, based on the present invention. Detailed Implementation

[0149] like Figure 1 and Figure 2 As shown, an incomplete multi-omics cancer subtype identification method includes the following steps:

[0150] Step 1: Acquisition and preprocessing of incomplete multi-omics data: Acquire multi-omics data from the same batch of patient samples to form a matrix set of incomplete multi-omics data, and perform patient sample ID alignment, abnormal sample removal, missing marker identification, column-level normalization, and scale unification on each omics data to obtain a matrix set of preprocessed incomplete multi-omics data.

[0151] Step 2: Constructing Entry-Level Observation Mask Matrix and View Availability Mask Matrix: An entry-level observation mask matrix is ​​constructed for each omics, and is used to determine the set of observation samples and the missing locations in the subsequent entry-level missing value recovery process. The entry-level missing value refers to the missing value of some patient samples in certain features within a certain omics, and its missing location is determined by the entry-level observation mask matrix. A view availability mask matrix is ​​also constructed to mark the availability status of each patient sample corresponding to each omics, and is used for subsequent mask-aware skeleton consensus alignment and sample-level mask-aware fusion.

[0152] Step 3: Construct feature modules within omics based on granular sphere partitioning: For each omics, temporarily fill in missing entries based on the median of the observed values ​​of each feature column to obtain a temporary fill matrix without missing values; treat each feature as a feature vector in the sample dimension, construct a feature embedding vector, and perform recursive granular sphere partitioning based on the directional consistency between features to obtain feature modules within the omics.

[0153] Step 4: Extract module-level skeleton representation: For each feature module in each omics, extract sample-level skeleton variables within the feature module to obtain the module-level skeleton representation. The module-level skeleton representation includes the module mean skeleton, the module's first principal component skeleton, and the standardized skeleton coordinates.

[0154] Step 5: Item-level missing value recovery based on skeleton structure: For each feature module, the set of observation samples and the item-level missing value position of each feature in the feature module are determined according to the item-level observation mask matrix constructed in Step 2. The standardized skeleton coordinates corresponding to the feature module are used as structural explanatory variables. A structural regression design matrix is ​​established for each feature in the feature module, and the missing values ​​corresponding to the item-level missing value position are recovered to obtain the candidate recovery matrix.

[0155] Step 6: Construct multi-expert skeleton restoration ensemble results: Construct multiple skeleton restoration experts, each of whom independently performs the particle-sphere partitioning, skeleton induction, and structural constraint restoration process to obtain candidate completion matrices; evaluate the restoration errors of different skeleton restoration experts based on the pseudo-missing validation set, and perform weighted ensemble of candidate completion matrices according to the restoration errors to obtain the final completion matrices of each omics.

[0156] Step 7: Construct the central feature matrix of the feature module: For each feature module, calculate the centroid of the feature module in the feature embedding space, and select the feature closest to the centroid of the feature module as the central feature of the feature module; according to the index of the central feature, extract the corresponding central feature from the final completion matrix of each omics obtained in Step 6 to obtain the central feature matrix.

[0157] Step 8: Construct the cross-omics consensus structure space: Concatenate the module-level skeleton representations of all omics to construct the cross-omics original structure matrix, and perform dimensionality reduction on the cross-omics original structure matrix to obtain the cross-omics consensus structure space.

[0158] Step 9: Perform mask-aware skeleton consensus alignment: Initialize the view soft representation matrix based on the central feature matrix of each omics, construct the sample-level availability diagonal matrix according to the view availability mask matrix, reconstruct the view soft representation matrix of each omics to the cross-omics consensus structure space under mask-aware conditions, and update the soft representation of each omics by minimizing the structure alignment loss.

[0159] Step 10: Adaptive view weighting based on structural reliability: Calculate the structural complexity of each omics and convert the structural complexity into an intermediate score of view reliability. Update the view weights based on the intermediate score of view reliability to obtain the global view weights.

[0160] Step 11: Perform sample-level mask-aware fusion: For each patient sample, the global view weights are renormalized at the sample level according to the available omics set, and the updated soft representations of different omics are weighted and concatenated to obtain the final fusion embedding matrix.

[0161] Step 12: Output cancer subtype clustering results: Perform clustering on the final fusion embedding matrix to obtain cancer subtype labels, and perform survival analysis, clinical label enrichment test and visualization analysis based on the cancer subtype labels.

[0162] Furthermore, the specific steps for preprocessing the incomplete multi-omics data in step 1 are as follows:

[0163] Step 1.1, Patient Sample ID Alignment and Abnormal Sample Removal: Align each omics data according to the patient sample ID, and retain only the patient samples that can be matched at the sample level.

[0164] Step 1.2, Missing Item Identification: Identify missing items in each omics and mark them as missing values.

[0165] Step 1.3, Column-level normalization and scale unification: For non-negative feature columns with a scale greater than a preset threshold, perform a logarithmic transformation:

[0166]

[0167] Calculate the mean and standard deviation of each feature column based on the observed values, and perform column-level standardization:

[0168]

[0169] in, Indicates the first In the first omics The first patient sample The original observations of each feature, Represents the eigenvalues ​​after logarithmic transformation. Represents the standardized eigenvalues. Indicates the first In the first omics The mean of each feature on the observed sample Indicates the first In the first omics The standard deviation of each feature on the observed sample This represents a very small positive number that is protected against division by zero.

[0170] Furthermore, step 3 specifically includes the following steps:

[0171] Step 3.1: For each omics, temporarily fill in the missing entries. The first in omics Let there be features, and denote their feature vector along the sample dimension as the sample vector:

[0172]

[0173] Among them, temporary matrix completion Indicates the first In the granular-sphere partitioning stage, the matrix obtained by temporarily imputing missing entries using the median of observations is only used for feature embedding construction and granular-sphere partitioning, and is not considered as the final missing item recovery result. Indicates temporary matrix completion No. The first in omics One characteristic, Indicates the first In the first omics The sample vector corresponding to each feature This indicates the number of patient samples.

[0174] Step 3.2: Center the eigenvectors and Normalization yields the feature embedding vector:

[0175]

[0176] in, Indicates the first In the first omics Each feature-normalized feature embedding vector Represents sample vector The mean across the sample dimension, Representing vectors Norm.

[0177] Step 3.3: For any candidate feature set to be partitioned The internal mean cosine similarity is defined as:

[0178]

[0179] in, This indicates the number of features in the candidate feature set. Used to reflect the consistency of characteristic directions within a set. Indicates the first The characteristic dimensions of omics , All represent candidate feature sets Different features in Indicates the first In the first omics Each feature-normalized feature embedding vector Indicates the first In the first omics Each feature-normalized feature embedding vector Indicates the first The complete set of features in omics.

[0180] Step 3.4: To reduce the computational complexity of explicitly calculating pairwise mean cosine similarity, let:

[0181]

[0182] in, Let represent the sum of the feature embedding vectors in the candidate feature set. Then, the mean cosine similarity is equivalently calculated as:

[0183]

[0184] Step 3.5: Define the candidate feature set The average cosine distance is:

[0185]

[0186] in, The smaller the value, the more consistent the feature directions within the candidate feature set, and the stronger the structural homogeneity.

[0187] Step 3.6: When the candidate feature set When the homogeneity requirement is not yet met, for the candidate feature set Principal component analysis (PCA) is performed on the corresponding feature embedding submatrix, and it is divided into two subsets along the median of the first principal component score. and ,in and These represent the left and right sub-feature sets obtained from the partitioning, respectively.

[0188] Step 3.7: Define the weighted average cosine distance of subsets as:

[0189]

[0190] Step 3.8: Define the candidate feature set The splitting improvement rate is:

[0191]

[0192] in, Used to represent the candidate feature set The degree of improvement in structural homogeneity resulting from binary division.

[0193] Step 3.9: When When, it indicates that for the current set of candidate features The improvement in structural homogeneity brought about by continuing binary splitting is insufficient, so we stop further splitting the current candidate feature set and merge the current candidate feature set. Directly as a final feature module; when Then, continue processing the left and right sub-feature sets obtained from the partitioning. and Recursive partitioning; when the number of generated feature sets reaches the preset upper limit for the number of feature sets. Stop dividing when the time is right, and output the first... All feature sets obtained after recursive particle sphere partitioning of each omics are denoted as the final feature modules. .in, This represents the stopping threshold for particle division.

[0194] Furthermore, steps 4 to 6 specifically include the following steps:

[0195] Step 4.1: For the first The first in omics Each feature module The mean skeleton of the module is calculated based on the corresponding module submatrix:

[0196]

[0197] in, Indicates the first The patient sample in the first The first omics The module mean skeleton value on each feature module.

[0198] Step 4.2: Perform first principal component analysis on the module submatrix to obtain the first principal component skeleton of the module:

[0199]

[0200] in, This represents the operation of extracting the score of the first principal component of the matrix. Indicates the first The temporary completion matrix of the omics is composed of the th omics Each feature module The module submatrix composed of corresponding feature columns.

[0201] Step 4.3: Perform column-level normalization on the first principal component skeleton of the module to obtain the normalized skeleton coordinates:

[0202]

[0203] in, This indicates column-level standardized operations. Indicates the first The first omics Standardized skeleton coordinates of each feature module.

[0204] Step 5.1: Using the standardized skeleton coordinates as structural explanatory variables, construct a structural regression design matrix with an intercept term:

[0205]

[0206] Step 5.2: For feature modules Any feature Let its response vector be:

[0207]

[0208] in, Indicates the first In the first omics A column vector of values ​​for each feature across all patient samples. Indicated by The response vector is obtained as the response. Define features. The set of observed samples is:

[0209]

[0210] in, Indicates the first In the first omics The first patient sample The values ​​for each feature are observed states, not missing values.

[0211] Step 5.3: In the observed sample set Fitting ridge regression parameters:

[0212]

[0213] in, Represents the structural regression design matrix Extracting the observation sample set The submatrix obtained after the corresponding row, This represents the ridge regularization coefficient during the skeletal reconstruction phase. Represents the identity matrix. Represents the response vector In the observed sample set The value on, Indicates the first In the first omics The ridge regression parameters corresponding to each feature.

[0214] Step 5.4: Based on the entry-level observation mask matrix constructed in Step 2, determine the observation sample set and entry-level missing positions for each feature within this feature module. For entry-level missing positions... The corresponding missing values ​​are recovered according to the following formula:

[0215]

[0216] in, Indicates the first In the first omics The first patient sample The recovered value of each feature, Indicates the first In the first omics The ridge regression parameters corresponding to each feature Represents the structural regression design matrix The Middle The row vectors corresponding to each patient sample; the candidate recovery matrix is ​​composed of the recovered values ​​at all missing entry positions and the original observation values, and the candidate recovery matrix is ​​used for subsequent multi-expert skeleton recovery integration.

[0217] Step 6.1: Extract partial coordinates from the original observation locations to construct a pseudo-missing validation set, build multiple skeleton restoration experts, and each skeleton restoration expert performs particle-sphere partitioning, skeleton induction, and structural constraint restoration using different random perturbation or different temporary completion initialization methods to obtain the candidate completion matrix:

[0218]

[0219] in, Indicates the first In the first omics A candidate completion matrix obtained by a skeleton restoration expert.

[0220] Step 6.2: Evaluate the recovery errors of different skeleton restoration experts based on the pseudo-missing validation set, and calculate the ensemble weights of the skeleton restoration experts based on their recovery errors on the pseudo-missing validation set.

[0221]

[0222] in, Indicates the first In the first omics Integrated weights of individual skeletal restoration experts Indicates the first In the first omics The recovery error of a skeleton restoration expert on a pseudo-missing validation set. Indicates the first In the first omics The recovery error of a skeleton restoration expert on a pseudo-missing validation set.

[0223] Step 6.3: Perform element-wise weighted integration of the candidate completion matrices of each skeleton restoration expert based on the magnitude of the restoration error to obtain the... The final completion matrix of omics :

[0224]

[0225] Furthermore, steps 7 to 9 specifically include the following steps:

[0226] Step 7.1: For the first The first in omics Each feature module Calculate the centroids of all feature embedding vectors in this feature module:

[0227]

[0228] in, Indicates the first The first omics The centroid of the feature embedding vector of each feature module.

[0229] Step 7.2: Calculate the cosine similarity between each feature embedding vector and the centroid within the feature module, and select the feature with the highest cosine similarity as the central feature of the feature module.

[0230]

[0231] Step 7.3: Compile the indices of the central features of all feature modules into a central feature index vector, and then extract the vector from the completed matrix. Extracting central features from the data yields the central feature matrix. .

[0232] Step 8.1: Concatenate the module mean skeleton, module first principal component skeleton, and standardized skeleton coordinates of each omics according to their columns to obtain the cross-omics original structure matrix. .

[0233] Step 8.2: For the aforementioned cross-omics original structure matrix... Performing PCA dimensionality reduction yields a low-dimensional cross-omics consensus structure space. :

[0234]

[0235] in, This represents the dimension of the cross-omics consensus structure space.

[0236] Step 9.1: For the first Omics, central feature matrix Initialized as a non-negative, row-normalized view soft representation matrix .

[0237] Step 9.2: Construct the first [view availability mask matrix] based on the view availability mask matrix. The sample-level availability diagonal matrix of each omics:

[0238]

[0239] in, A sample-level view availability mask, used to represent the availability of the first-order view. Does the patient sample have the first Omics.

[0240] Step 9.3: Under mask-aware conditions, calculate the first... Omics in the cross-omics consensus structure space ridge structure reconstruction:

[0241]

[0242] in, Indicates the first The reconstruction results of omics in the consensus structure space This represents the ridge regularization coefficient during the structural alignment phase. Represents the identity matrix.

[0243] Step 9.4: Define the structure alignment loss to measure the difference between the view soft representation matrix and its reconstruction result in the cross-omics consensus structure space; construct an update direction based on the residual term corresponding to the structure alignment loss, and use this update direction to iteratively update the view soft representation matrix. The structure alignment loss is:

[0244]

[0245] The structural alignment residual gradient is:

[0246]

[0247] The soft representation is updated as follows:

[0248]

[0249] Step 9.5: Perform nonnegation and row-by-row normalization projection on the updated soft representation matrix to satisfy the nonnegation constraint and row normalization constraint.

[0250] Furthermore, steps 10 to 11 specifically include the following steps:

[0251] Step 10.1: For the first Omics-based computation of the soft representation of patient samples after availability masking:

[0252]

[0253] in, Indicates the first The soft representation of each omics after sample availability masking.

[0254] Step 10.2: Use truncated singular sums as the structural complexity of omics:

[0255]

[0256] in, Indicates the structural complexity of the v-th omics. express The first of the matrix A singular value, This indicates the number of truncated singular values.

[0257] Step 10.3: Standardize the structural complexity of all omics and convert it into an intermediate score for view reliability:

[0258]

[0259] in, Indicates the first The intermediate reliability score of the standardized view of omics. Represents the mean function, This represents the standard deviation function.

[0260] Step 10.4: Shift the intermediate score of view reliability to a non-negative score:

[0261]

[0262] in, This represents the non-negative score of the v-th group. Indicates the first The intermediate reliability score of the standardized view of omics. This represents the total number of omics studies.

[0263] Step 10.5: Update the view weights using power normalization to obtain the global view weights:

[0264]

[0265] in, This represents the view weight adjustment parameter. Views with lower structural complexity and greater structural stability receive higher reliability scores and thus greater weight in the final fusion.

[0266] Step 11.1: For the first For each patient sample, define its available omics set:

[0267]

[0268] in, Indicates the first A set of available omics indexes for each patient sample.

[0269] Step 11.2: Based on global view weight and sample-level view availability mask Construct sample-level normalized view weights:

[0270]

[0271] in, Indicates the first The patient sample corresponds to the first Sample-level normalized view weights for each omics.

[0272] Step 11.3: Weight the updated soft representations of each omics according to the sample-level normalized view weights, and concatenate them to obtain the... Fusion representation of individual patient samples:

[0273]

[0274] in, Indicates the first The patient sample in the first The updated soft representation vector in each omics. Indicates the first The updated soft representation of each omics is the result of weighting the sample-level normalized view.

[0275] Step 11.4: For Perform row normalization and stack the fused representations of all patient samples row by row to obtain the final fused embedding matrix:

[0276]

[0277] An incomplete multi-omics cancer subtype identification system includes: a data acquisition and preprocessing module, a mask construction module, a granular feature module construction module, a module-level skeleton extraction module, an entry-level missing item recovery module, a multi-expert skeleton recovery integration module, a module-centric feature representation construction module, a cross-omics consensus structure space construction module, a mask-aware skeleton consensus alignment module, a structure reliability adaptive view weighting module, a sample-level mask-aware fusion module, and a cancer subtype output module; each of the above modules has the function of implementing steps 1-12 above.

[0278] An electronic device includes a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that: when the processor executes the computer program, it implements the steps of the incomplete multi-omics cancer subtype identification method described above.

[0279] A computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the incomplete multi-omics cancer subtype identification method described above.

[0280] Specific embodiments of the present invention are as follows:

[0281] This embodiment uses breast cancer (BRCA) multi-omics data as an example to specifically illustrate the incomplete multi-omics cancer subtype identification method described in this invention. The BRCA dataset used in this embodiment comes from the TCGA public database and contains three omics data from the same batch of patient samples: mRNA expression data, DNA methylation data, and miRNA expression data, along with patient clinical follow-up information and clinicopathological information. The BRCA dataset contains 624 patient samples, with an mRNA expression feature dimension of 20531, a DNA methylation feature dimension of 5000, and a miRNA expression feature dimension of 1046.

[0282] In the data preprocessing stage, the different omics data were first strictly aligned according to patient sample IDs, retaining only patient samples that corresponded at the sample level. Then, missing entries in each omics data were identified, and the missing locations were uniformly marked as missing values. For non-negative feature columns with large scales, logarithmic transformation was performed, and the mean and standard deviation of each feature column were calculated based on the observed values, followed by column-level standardization. Through the above process, a matrix set of preprocessed, incomplete multi-omics data was obtained for subsequent analysis. Subsequently, an entry-level observation mask matrix was constructed for each omics to determine the set of observed samples and missing locations during the entry-level missing data recovery process. Simultaneously, a view availability mask matrix was constructed to mark the availability status of each patient sample corresponding to each omics, and was used for subsequent mask-aware skeleton consensus alignment and sample-level mask-aware fusion. The sample size and feature dimension statistics of the BRCA cancer dataset are shown in Table 1.

[0283] Table 1: Overview of the BRCA Cancer Dataset

[0284]

[0285] Based on the aforementioned data input, this invention first performs granulocyte-sphere skeleton induction, structural constraint loss recovery, and multi-expert skeleton recovery integration within each omics to obtain the final completion matrix, module-level skeleton representation, and central feature representation for each omics. Then, the module-level skeleton representations from different omics are used together to construct a cross-omics consensus structure space. Using this consensus structure space as a structural anchor, mask-aware skeleton consensus alignment is performed on the view soft representation matrix initialized from the central feature matrices of each omics. Based on this, the intermediate score of view reliability is calculated according to the structural complexity of the aligned representations of each omics, and the global view weight is further updated. Sample-level mask-aware fusion is then performed in conjunction with the actual available omics set for patient samples to finally obtain a fusion embedding matrix for BRCA patient subtype identification. Finally, K-means clustering is performed on this fusion embedding matrix to classify BRCA patients into 5 cancer subtypes, and the corresponding subtype labels are output.

[0286] To evaluate the effectiveness of this invention in the incomplete multi-omics cancer subtype identification task, this embodiment uses two dimensions as evaluation indicators: significant survival differences and clinical label enrichment. Firstly, the log-rank test is used to measure the overall survival differences between different predicted subtypes, and... The significance level is reported in the form of statistical significance, with higher values ​​indicating more significant survival stratification between subtypes. Secondly, clinical label enrichment analysis is conducted to verify the clinical interpretability of the subtype results: the chi-square test is used for discrete clinicopathological parameters, and the Kruskal-Wallis test is used for numerical clinical parameters. The number of clinical parameters significantly correlated with the subtype at the statistical significance level is counted as the number of enrichment labels. Through the above evaluation indicators, this invention can verify the effectiveness and application value of the identified cancer molecular subtypes from both statistical significance and clinical interpretability perspectives.

[0287] This embodiment systematically compares the method of the present invention with a variety of mainstream multi-omics cancer subtype identification methods, and some experimental results are shown in Table 2.

[0288] Table 2: Experimental Results of the BRCA Cancer Dataset

[0289]

[0290] In terms of survival analysis, the method of this invention yielded the following survival difference significance index on the BRCA dataset: The results, exceeding those of most comparative methods, indicate that the different breast cancer subtypes identified by the method of this invention show significant differences in overall survival time, effectively reflecting the stratification of patient prognostic risk. In terms of clinical label enrichment analysis, this invention identified five clinicopathological parameters significantly associated with the subtypes, demonstrating that the obtained cancer subtypes are not only statistically significant but also strongly correlated with clinical phenotypes.

[0291] Furthermore, to verify the biological consistency of the BRCA subtypes identified in this invention, this embodiment also introduces PAM50 molecular typing as an external reference standard. PAM50 is a classic molecular typing gene signature widely used in breast cancer research, which can classify breast cancer samples into molecular subtypes such as Basal-like, HER2-enriched, Luminal A, Luminal B, and Normal-like. To avoid the direct influence of known typing genes on the clustering results, this embodiment removes the gene features corresponding to PAM50 from the mRNA expression data during model training, so that they do not participate in the unsupervised clustering learning of this invention. After clustering is completed, the samples are then post-typed using the PAM50 gene expression pattern, and the correspondence between the clustering results of this invention and the PAM50 molecular subtypes is compared.

[0292] Experimental results show that even without using the PAM50 gene during training, the BRCA clustering results obtained in this invention can still recover molecular subtype structures highly consistent with classic PAM50 classification. For example, some clusters mainly correspond to Basal-like samples, some clusters mainly correspond to Normal-like samples, and Luminal A and Luminal B samples also show significant enrichment in different clusters. This indicates that this invention does not rely on known subtyping marker genes for direct classification, but can automatically capture the intrinsic molecular structure of breast cancer from other multi-omics features.

[0293] Finally, see Figure 3 and Figure 4 As shown, the system platform can visualize the multi-omics clustering and analysis results obtained by this invention, including Kaplan-Meier survival curves corresponding to cancer subtypes and PAM50 gene expression heatmaps. Through these visualizations, users can intuitively observe survival differences, clinical correlations, and differences in molecular expression characteristics among different cancer subtypes, thereby improving the understandability and practicality of the analysis results. In summary, this embodiment demonstrates that this invention can achieve stable, reliable, and biologically significant BRCA cancer subtype identification under incomplete multi-omics conditions.

Claims

1. A method for identifying incomplete multi-omics cancer subtypes, characterized in that, Includes the following steps: Step 1: Acquisition and preprocessing of incomplete multi-omics data: Acquire multi-omics data of the same batch of patient samples to form a matrix set of incomplete multi-omics data, and perform patient sample ID alignment, abnormal sample removal, missing marker identification, column-level normalization and scale unification on each omics data to obtain a matrix set of preprocessed incomplete multi-omics data. Step 2: Constructing Entry-Level Observation Mask Matrix and View Availability Mask Matrix: An entry-level observation mask matrix is ​​constructed for each omics, and is used to determine the set of observation samples and missing locations in the subsequent entry-level missing value recovery process. Entry-level missing values ​​refer to the missing values ​​of some patient samples on certain features within a specific omics, and their missing locations are determined by the entry-level observation mask matrix. A view availability mask matrix is ​​also constructed to mark the availability status of each patient sample corresponding to each omics, and is used for subsequent mask-aware skeleton consensus alignment and sample-level mask-aware fusion. Step 3: Construct feature modules within omics based on granular sphere partitioning: For each omics, temporarily fill in missing entries based on the median of the observed values ​​of each feature column to obtain a temporary fill matrix without missing values; treat each feature as a feature vector in the sample dimension, construct feature embedding vectors, and recursively partition granular spheres based on the directional consistency between features to obtain feature modules within omics. Step 4: Extract module-level skeleton representation: For each feature module in each omics, extract sample-level skeleton variables within the feature module to obtain the module-level skeleton representation. The module-level skeleton representation includes the module mean skeleton, the module's first principal component skeleton, and the standardized skeleton coordinates. Step 5: Item-level missing value recovery based on skeleton structure: For each feature module, the set of observed samples and the item-level missing value position of each feature in the feature module are determined according to the item-level observation mask matrix constructed in Step 2. The standardized skeleton coordinates corresponding to the feature module are used as structural explanatory variables. A structural regression design matrix is ​​established for each feature in the feature module, and the missing values ​​corresponding to the item-level missing value position are recovered to obtain the candidate recovery matrix. Step 6: Construct multi-expert skeleton restoration ensemble results: Construct multiple skeleton restoration experts, each of whom independently performs the particle-sphere partitioning, skeleton induction, and structural constraint restoration processes to obtain candidate completion matrices; evaluate the restoration errors of different skeleton restoration experts based on the pseudo-missing validation set, and perform weighted ensemble of the candidate completion matrices according to the restoration errors to obtain the final completion matrices of each omics; Step 7: Construct the central feature matrix of the feature module: For each feature module, calculate the centroid of the feature module in the feature embedding space, and select the feature closest to the centroid of the feature module as the central feature of the feature module; according to the index of the central feature, extract the corresponding central feature from the final completion matrix of each omics obtained in Step 6 to obtain the central feature matrix. Step 8: Construct the cross-omics consensus structure space: Concatenate the module-level skeleton representations of all omics to construct the cross-omics original structure matrix, and perform dimensionality reduction on the cross-omics original structure matrix to obtain the cross-omics consensus structure space. Step 9: Perform mask-aware skeleton consensus alignment: Initialize the view soft representation matrix based on the central feature matrix of each omics, construct the sample-level availability diagonal matrix according to the view availability mask matrix, reconstruct the view soft representation matrix of each omics to the cross-omics consensus structure space under mask-aware conditions, and update the soft representation of each omics by minimizing the structure alignment loss. Step 10: Adaptive view weighting based on structural reliability: Calculate the structural complexity of each omics and convert the structural complexity into an intermediate view reliability score. Update the view weights based on the intermediate view reliability scores to obtain the global view weights. Step 11: Perform sample-level mask-aware fusion: For each patient sample, the global view weights are renormalized at the sample level according to the available omics set, and the updated soft representations of different omics are weighted and concatenated to obtain the final fusion embedding matrix. Step 12: Output cancer subtype clustering results: Perform clustering on the final fusion embedding matrix to obtain cancer subtype labels, and perform survival analysis, clinical label enrichment test and visualization analysis based on the cancer subtype labels.

2. The incomplete multi-omics cancer subtype identification method according to claim 1, characterized in that, The specific steps for preprocessing the incomplete multi-omics data in step 1 are as follows: Step 1.1, Patient Sample ID Alignment and Abnormal Sample Removal: Align the data of each omics according to the patient sample ID, and retain only the patient samples that can be corresponding at the sample level; Step 1.2, Missing Item Identification: Identify missing items in each omics and uniformly mark them as missing values; Step 1.3, Column-level normalization and scale unification: For non-negative feature columns with a scale greater than a preset threshold, perform a logarithmic transformation: Calculate the mean and standard deviation of each feature column based on the observed values, and perform column-level standardization: in, Indicates the first In the first omics The first patient sample The original observations of each feature, Represents the eigenvalues ​​after logarithmic transformation. Represents the standardized eigenvalues. Indicates the first In the first omics The mean of each feature on the observed sample Indicates the first In the first omics The standard deviation of each feature on the observed sample This represents a very small positive number that is protected against division by zero.

3. The incomplete multi-omics cancer subtype identification method according to claim 2, characterized in that, Step 3 specifically includes the following steps: Step 3.1: For each omics, temporarily fill in the missing entries. The first in omics Let there be features, and denote their feature vector along the sample dimension as the sample vector: Among them, temporary matrix completion Indicates the first In the granular-sphere partitioning stage, the matrix obtained by temporarily imputing missing entries using the median of observations is only used for feature embedding construction and granular-sphere partitioning, and is not considered as the final missing item recovery result. Indicates temporary matrix completion No. The first in omics One characteristic, Indicates the first In the first omics The sample vector corresponding to each feature Indicates the number of patient samples; Step 3.2: Center the eigenvectors and Normalization yields the feature embedding vector: in, Indicates the first In the first omics Each feature-normalized feature embedding vector Represents sample vector The mean across the sample dimension, Representing vectors Norm; Step 3.3: For any candidate feature set to be partitioned The internal mean cosine similarity is defined as: in, This indicates the number of features in the candidate feature set. Used to reflect the consistency of characteristic directions within a set. Indicates the first The characteristic dimensions of omics , All represent candidate feature sets Different features in Indicates the first In the first omics Each feature-normalized feature embedding vector Indicates the first In the first omics Each feature-normalized feature embedding vector Indicates the first The complete set of features in omics; Step 3.4: To reduce the computational complexity of explicitly calculating pairwise mean cosine similarity, let: in, Let the sum of the feature embedding vectors in the candidate feature set be the vector; then the mean cosine similarity is equivalently calculated as: Step 3.5: Define the candidate feature set The average cosine distance is: in, The smaller the value, the more consistent the feature directions are within the candidate feature set, and the stronger the structural homogeneity. Step 3.6: When the candidate feature set When the homogeneity requirement is not yet met, for the candidate feature set Principal component analysis (PCA) is performed on the corresponding feature embedding submatrix, and it is divided into two subsets along the median of the first principal component score. and ,in and These represent the left and right sub-feature sets obtained from the partitioning, respectively; Step 3.7: Define the weighted average cosine distance of subsets as: Step 3.8: Define the candidate feature set The splitting improvement rate is: in, Used to represent the candidate feature set The degree of improvement in structural homogeneity resulting from binary division; Step 3.9: When When, it indicates that for the current set of candidate features The improvement in structural homogeneity brought about by continuing binary splitting is insufficient, so we stop further splitting the current candidate feature set and merge the current candidate feature set. Directly as a final feature module; when Then, continue to analyze the left and right sub-feature sets obtained from the partitioning. and Recursive partitioning; when the number of generated feature sets reaches the preset upper limit for the number of feature sets. Stop dividing when the time is right, and output the first... All feature sets obtained after recursive particle sphere partitioning of each omics are denoted as the final feature modules. ;in, This represents the stopping threshold for particle separation.

4. The incomplete multi-omics cancer subtype identification method according to claim 3, characterized in that, Steps 4 to 6 specifically include the following steps: Step 4.1: For the first The first in omics Each feature module The mean skeleton of the module is calculated based on the corresponding module submatrix: in, Indicates the first The patient sample in the first The first omics The module mean skeleton value on each feature module; Step 4.2: Perform first principal component analysis on the module submatrix to obtain the first principal component skeleton of the module: in, This represents the operation of extracting the score of the first principal component of the matrix. Indicates the first The temporary completion matrix of the omics is composed of the th omics Each feature module The module submatrix is ​​composed of corresponding feature columns; Step 4.3: Perform column-level normalization on the first principal component skeleton of the module to obtain the normalized skeleton coordinates: in, This indicates column-level standardized operations. Indicates the first The first omics Standardized skeleton coordinates of each feature module; Step 5.1: Using the standardized skeleton coordinates as structural explanatory variables, construct a structural regression design matrix with an intercept term: Step 5.2: For feature modules Any feature Let its response vector be: in, Indicates the first In the first omics A column vector of values ​​for each feature across all patient samples. Indicates The response vector obtained as a response; defining features The set of observed samples is: in, Indicates the first In the first omics The first patient sample The values ​​of each feature are observed states, not missing values; Step 5.3: In the observed sample set Fitting ridge regression parameters: in, Represents the structural regression design matrix Extracting the observation sample set The submatrix obtained after the corresponding row, This represents the ridge regularization coefficient during the skeletal reconstruction phase. Represents the identity matrix. Represents the response vector In the observed sample set The value on, Indicates the first In the first omics The ridge regression parameters corresponding to each feature; Step 5.4: Based on the entry-level observation mask matrix constructed in Step 2, determine the observation sample set and entry-level missing positions for each feature within this feature module. For entry-level missing positions... The corresponding missing values ​​are recovered according to the following formula: in, Indicates the first In the first omics The first patient sample The recovered value of each feature, Indicates the first In the first omics The ridge regression parameters corresponding to each feature Represents the structural regression design matrix The Middle The row vectors corresponding to each patient sample; the candidate recovery matrix is ​​composed of the recovered values ​​at all missing entry positions and the original observation values, and the candidate recovery matrix is ​​used for subsequent multi-expert skeleton recovery integration; Step 6.1: Extract partial coordinates from the original observation locations to construct a pseudo-missing validation set, build multiple skeleton restoration experts, and each skeleton restoration expert performs particle-sphere partitioning, skeleton induction, and structural constraint restoration using different random perturbation or different temporary completion initialization methods to obtain the candidate completion matrix: in, Indicates the first In the first omics A candidate completion matrix obtained by a skeleton restoration expert; Step 6.2: Evaluate the recovery errors of different skeleton restoration experts based on the pseudo-missing validation set, and calculate the ensemble weights of the skeleton restoration experts based on their recovery errors on the pseudo-missing validation set. in, Indicates the first In the first omics Integrated weights of individual skeletal restoration experts Indicates the first In the first omics The recovery error of a skeleton restoration expert on a pseudo-missing validation set. Indicates the first In the first omics The recovery error of a skeleton restoration expert on a pseudo-missing validation set; Step 6.3: Perform element-wise weighted integration of the candidate completion matrices of each skeleton restoration expert based on the magnitude of the restoration error to obtain the... The final completion matrix of omics : 。 5. The incomplete multi-omics cancer subtype identification method according to claim 4, characterized in that, Steps 7 to 9 specifically include the following steps: Step 7.1: For the first The first in omics Each feature module Calculate the centroids of all feature embedding vectors in this feature module: in, Indicates the first The first omics The centroid of the feature embedding vector of each feature module; Step 7.2: Calculate the cosine similarity between each feature embedding vector and the centroid within the feature module, and select the feature with the highest cosine similarity as the central feature of the feature module. Step 7.3: Compile the indices of the central features of all feature modules into a central feature index vector, and then extract the vector from the completed matrix. Extracting central features from the data yields the central feature matrix. ; Step 8.1: Concatenate the module mean skeleton, module first principal component skeleton, and standardized skeleton coordinates of each omics according to their columns to obtain the cross-omics original structure matrix. ; Step 8.2: For the aforementioned cross-omics original structure matrix... Performing PCA dimensionality reduction yields a low-dimensional cross-omics consensus structure space. : in, The dimension representing the cross-omics consensus structure space; Step 9.1: For the first Omics, central feature matrix Initialized as a non-negative, row-normalized view soft representation matrix ; Step 9.2: Construct the first [view availability mask matrix] based on the view availability mask matrix. The sample-level availability diagonal matrix of each omics: in, A sample-level view availability mask, used to represent the availability of the first-order view. Does the patient sample have the first Omics; Step 9.3: Under mask-aware conditions, calculate the first... Omics in the cross-omics consensus structure space ridge structure reconstruction: in, Indicates the first The reconstruction results of omics in the consensus structure space This represents the ridge regularization coefficient during the structural alignment phase. Represents the identity matrix; Step 9.4: Define the structure alignment loss to measure the difference between the view soft representation matrix and its reconstruction result in the cross-omics consensus structure space; construct an update direction based on the residual term corresponding to the structure alignment loss, and use this update direction to iteratively update the view soft representation matrix; wherein the structure alignment loss is: The structural alignment residual gradient is: The soft representation is updated as follows: Step 9.5: Perform nonnegation and row-by-row normalization projection on the updated soft representation matrix to satisfy the nonnegation constraint and row normalization constraint.

6. The incomplete multi-omics cancer subtype identification method according to claim 5, characterized in that, Steps 10 and 11 specifically include the following steps: Step 10.1: For the first Omics-based computation of the soft representation of patient samples after availability masking: in, Indicates the first The soft representation of each omics after sample availability masking; Step 10.2: Use truncated singular sums as the structural complexity of omics: in, Indicates the structural complexity of the v-th omics. express The first of the matrix A singular value, Indicates the number of truncated singular values; Step 10.3: Standardize the structural complexity of all omics and convert it into an intermediate score for view reliability: in, Indicates the first The intermediate reliability score of the standardized view of omics. Represents the mean function, Represents the standard deviation function; Step 10.4: Shift the intermediate reliability score of the view to a non-negative score: in, This represents the non-negative score of the v-th group. Indicates the first The intermediate reliability score of the standardized view of omics. Indicates the total number of omics studies; Step 10.5: Update the view weights using power normalization to obtain the global view weights: in, This represents the view weight adjustment parameter; the lower the structural complexity and the more stable the structure of a view, the higher its reliability score, and the greater its weight in the final fusion. Step 11.1: For the first For each patient sample, define its available omics set: in, Indicates the first The set of available omics indexes for each patient sample; Step 11.2: Based on global view weight and sample-level view availability mask Construct sample-level normalized view weights: in, Indicates the first The patient sample corresponds to the first Sample-level normalized view weights for each omics; Step 11.3: Weight the updated soft representations of each omics according to the sample-level normalized view weights, and concatenate them to obtain the... Fusion representation of individual patient samples: in, Indicates the first The patient sample in the first The updated soft representation vector in each omics. Indicates the first The updated soft representation of each omics is the result of weighting the sample-level normalized view weights; Step 11.4: For Perform row normalization and stack the fused representations of all patient samples row by row to obtain the final fused embedding matrix: 。 7. An incomplete multi-omics cancer subtype identification system, characterized in that, include: The module comprises a data acquisition and preprocessing module, a mask construction module, a granular feature module construction module, a module-level skeleton extraction module, an entry-level missing item recovery module, a multi-expert skeleton recovery integration module, a module center feature representation construction module, a cross-omics consensus structure space construction module, a mask-aware skeleton consensus alignment module, a structure reliability adaptive view weighting module, a sample-level mask-aware fusion module, and a cancer subtype output module; each of these modules has the function of implementing steps 1-12 in claims 1-6.

8. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that: When the processor executes the computer program, it implements the steps of the incomplete multi-omics cancer subtype identification method according to any one of claims 1 to 6.

9. A computer-readable storage medium having a computer program stored thereon, characterized in that: When the computer program is executed by the processor, it implements the steps of the incomplete multi-omics cancer subtype identification method according to any one of claims 1 to 6.