A machine learning-based nk / t cell lymphoma multi-omics molecular typing classification method

By using machine learning methods on multi-omics datasets, core classification features were selected and an XGBoost model was trained, which solved the problem of inaccurate molecular subtype classification of NK/T cell lymphoma and achieved highly accurate and stable subtype classification.

CN122245436APending Publication Date: 2026-06-19SUN YAT SEN UNIVERSITY CANCER CENTER (CANCER HOSPITAL AFFILIATED TO SUN YAT SEN UNIVERSITY CANCER RESEARCH INSTITUTE OF SUN YAT SEN UNIVERSITY) +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
SUN YAT SEN UNIVERSITY CANCER CENTER (CANCER HOSPITAL AFFILIATED TO SUN YAT SEN UNIVERSITY CANCER RESEARCH INSTITUTE OF SUN YAT SEN UNIVERSITY)
Filing Date
2026-03-09
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing technologies suffer from inaccurate selection of molecular subtype classification features for NK/T-cell lymphoma and poor robustness of classification models, resulting in inaccurate classification.

Method used

We employed a multi-omics approach based on machine learning to obtain a multi-dimensional dataset of NK/T cell lymphoma samples. Through XGBoost classifier and linear discriminant analysis, we screened out core classification features, trained an XGBoost classification model, and optimized hyperparameters to improve classification accuracy.

Benefits of technology

This study achieved precise classification of four molecular subtypes of NK/T-cell lymphoma, improving the accuracy and robustness of classification and revealing the potential for subtype-specific treatment.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122245436A_ABST
    Figure CN122245436A_ABST
Patent Text Reader

Abstract

This invention belongs to the field of bioinformatics technology and discloses a machine learning-based multi-omics molecular subtyping method for NK / T-cell lymphoma. The method includes: acquiring a multi-dimensional dataset of NK / T-cell lymphoma samples, which contains molecular feature data and corresponding four-category molecular subtype labels; iteratively partitioning the multi-dimensional dataset and training an XGBoost classifier using the partitioned datasets to obtain core classification features; the feature importance of the XGBoost classifier is represented by the average gain of the features across all partitions; using linear discriminant analysis to pre-evaluate the four-category discriminant potential of the core classification features; if the pre-evaluation result does not meet the target, the core classification features are reacquired; and based on the core classification features, training an XGBoost classification model using the multi-dimensional dataset to obtain a target classification model for NK / T-cell lymphoma molecular subtyping. This method enables accurate classification of NK / T-cell lymphoma four-category molecular subtypes.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of bioinformatics technology, specifically relating to a multi-omics molecular subtyping method for NK / T cell lymphoma based on machine learning. Background Technology

[0002] NK / T-cell lymphoma is a highly heterogeneous malignant lymphoma, with significant differences in clinical prognosis and treatment response among patients with different molecular subtypes. Accurate molecular subtype classification is a crucial prerequisite for achieving individualized treatment of NK / T-cell lymphoma.

[0003] Currently, the classification of molecular subtypes of NK / T-cell lymphoma mainly relies on single-dimensional feature analysis such as genomics, transcriptomics, or proteomics. This results in problems such as inaccurate feature selection, poor robustness of classification models, and susceptibility to overfitting, leading to insufficient accuracy in the classification of the four molecular subtypes of NK / T-cell lymphoma. Summary of the Invention

[0004] The purpose of this invention is to provide a multi-omics molecular subtyping classification method and system for NK / T cell lymphoma based on machine learning, which solves the problems of inaccurate selection of molecular subtype classification features and poor robustness of classification models in the prior art, and achieves accurate classification of four molecular subtypes of NK / T cell lymphoma.

[0005] The first aspect of this invention discloses a multi-omics molecular subtyping method for NK / T-cell lymphoma based on machine learning, comprising:

[0006] A multidimensional dataset of NK / T cell lymphoma samples was obtained. The multidimensional dataset contained molecular feature data of the samples and corresponding four-category molecular subtype labels. The molecular feature data included at least one of proteomics expression data, transcriptomics data, and genomic variation data.

[0007] The multidimensional dataset is iteratively partitioned and an XGBoost classifier is trained using the partitioned dataset to obtain core classification features. The feature importance of the XGBoost classifier is represented by the average gain of the features across all partitions.

[0008] Linear discriminant analysis is used to pre-evaluate the four-class classification potential of the core classification features. If the pre-evaluation results do not meet the target, the core classification features are re-acquired.

[0009] Based on the core classification features, the XGBoost classification model is trained using the multi-dimensional dataset to obtain a target classification model for molecular subtyping of NK / T cell lymphoma.

[0010] In some implementations, when training an XGBoost classification model, the hyperparameters of the XGBoost classification model are optimized through grid search to maximize classification accuracy, and the optimal parameter configuration is determined by cross-validation on the training data.

[0011] In some implementations, the iterative partitioning of the multi-dimensional dataset and the training of an XGBoost classifier using the partitioned dataset to obtain core classification features includes:

[0012] The multidimensional dataset is iteratively divided into several training and validation sets using random seeds;

[0013] For each training set obtained from the partitioning, train the XGBoost classifier optimized for four-class classification;

[0014] In the XGBoost classifier obtained in each iteration of training, all features are ranked by importance. The importance ranking results of all iterations of training are aggregated to obtain a cumulative ranking. The cumulative ranking is sorted in ascending order, and the top preset number of features are selected as the core classification features.

[0015] In some implementations, the expression for calculating the average gain of the feature across all partitions is:

[0016] ,

[0017] in, This represents the number of decision trees in the XGBoost classifier. Representation of features For its branch tree The accuracy improvement brought about by this The calculation expression is:

[0018] ,

[0019] in, This represents the sum of gradients for the samples in the left subtree. This represents the sum of gradients for the samples in the right subtree. This is the sum of the second derivatives of the samples in the left subtree. This is the sum of the second derivatives of the samples in the right subtree. For regularization parameters, This is the minimum gain threshold for splitting the decision tree.

[0020] In some implementations, a confusion matrix is ​​used, and each category is treated as a decision task of binary classification with other categories. Receiver operating characteristic curves are plotted and area under the curve is calculated to evaluate the performance of the XGBoost classification model.

[0021] In some implementations, when acquiring multidimensional datasets of NK / T cell lymphoma samples, subtypes of NK / T cell lymphoma samples are independently identified from four omics detection platforms: somatic mutation, copy number variation, transcriptomics, and proteomics. The subtype identification results of each platform are converted into binary indicator matrices, consensus clustering is performed on the binary indicator matrices, and the optimal number of clusters is selected based on the relative increasing trend of the area under the cumulative distribution function to obtain the subtype classification of NK / T cell lymphoma.

[0022] The second aspect of this invention discloses a machine learning-based multi-omics molecular subtyping system for NK / T-cell lymphoma, comprising:

[0023] The dataset module is used to obtain a multidimensional dataset of NK / T cell lymphoma samples. The multidimensional dataset contains molecular feature data of the samples and corresponding four-category molecular subtype labels. The molecular feature data includes at least one of proteomics expression data, transcriptomics data, and genomic variation data.

[0024] The core classification feature module is used to iteratively partition the multi-dimensional dataset and train an XGBoost classifier using the partitioned dataset to obtain core classification features. The feature importance of the XGBoost classifier is represented by the average gain of the features across all partitions.

[0025] The pre-evaluation module is used to pre-evaluate the four-class classification potential of the core classification features using linear discriminant analysis. When the pre-evaluation result does not meet the target, the core classification features are re-acquired.

[0026] The training module is used to train the XGBoost classification model based on the core classification features and the multi-dimensional dataset to obtain a target classification model for molecular subtyping of NK / T cell lymphoma.

[0027] In some implementations, the core classification feature module includes an iterative partitioning unit, a classifier training unit, and a ranking unit;

[0028] The iterative partitioning unit is used to iteratively partition the multidimensional dataset into several training sets and validation sets using a random seed;

[0029] The classifier training unit is used to train an XGBoost classifier optimized for four-class classification for each training set obtained from the partition.

[0030] The ranking unit is used to rank the importance of all features in the XGBoost classifier obtained in each iteration of training, aggregate the importance ranking results of all iterations of training to obtain a cumulative ranking, sort the cumulative ranking in ascending order, and select the top preset number of features as the core classification features.

[0031] A third aspect of the present invention discloses an electronic device, including a memory storing executable program code and a processor coupled to the memory; the processor calls the executable program code stored in the memory to execute the machine learning-based NK / T cell lymphoma multi-omics molecular subtyping classification method disclosed in the first aspect.

[0032] The fourth aspect of the present invention discloses a computer-readable storage medium storing a computer program, wherein the computer program causes a computer to execute the machine learning-based NK / T cell lymphoma multi-omics molecular subtyping classification method disclosed in the first aspect.

[0033] The beneficial effects of this invention are that it integrates molecular feature data such as genomics, transcriptomics, proteomics, and single-cell RNA sequencing data, optimizes the XGBoost classifier for the four molecular subtypes of NK / T-cell lymphoma, fully explores the feature differences between subtypes, obtains core classification features, pre-evaluates the discriminative potential of the core classification features through linear discriminant analysis to ensure the effective distinguishing ability of the core classification features, and then trains the XGBoost classification model based on the core classification features to improve the classification accuracy and robustness of the four molecular subtypes of NK / T-cell lymphoma. Attached Figure Description

[0034] The accompanying drawings illustrate specific examples of the technical solutions described in this invention and, together with the detailed embodiments, form part of the specification, serving to explain the technical solutions, principles, and effects of this invention.

[0035] Unless otherwise specified or defined, the same reference numerals in different figures represent the same or similar technical features, and different reference numerals may be used to represent the same or similar technical features.

[0036] Figure 1 This is a flowchart of an embodiment of a multi-omics molecular subtyping classification method for NK / T cell lymphoma based on machine learning according to an embodiment of the present invention;

[0037] Figure 2 This is a flowchart illustrating the process of obtaining core classification features according to an embodiment of the present invention;

[0038] Figure 3This is a schematic diagram of the structure of the machine learning-based NK / T cell lymphoma multi-omics molecular subtyping classification system according to an embodiment of the present invention;

[0039] Figure 4 This is a schematic diagram of the structure of an electronic device according to an embodiment of the present invention. Detailed Implementation

[0040] Unless otherwise specified or defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art. When combined with the technical solutions of the invention in a real-world scenario, all technical and scientific terms used herein may also have meanings corresponding to the purpose of achieving the technical solutions of the invention. The terms "first," "second," etc., used herein are merely for distinguishing names and do not represent a specific number or order. The term "and / or," as used herein, includes any and all combinations of one or more of the associated listed items.

[0041] It should be noted that when a component is considered "fixed" to another component, it can be directly fixed to the other component or there can be an intervening component; when a component is considered "connected" to another component, it can be directly connected to the other component or there can be an intervening component; when a component is considered "mounted" on another component, it can be directly mounted on the other component or there can be an intervening component; when a component is considered "placed" on another component, it can be directly placed on the other component or there can be an intervening component.

[0042] Unless otherwise specified or defined, the terms "described" or "the" as used herein refer to the technical features or technical content mentioned or described prior to the relevant section, which may be the same as or similar to the technical features or technical content mentioned herein. Furthermore, the terms "comprising" and "having," and any variations thereof, as used herein, are intended to cover non-exclusive inclusion. For example, a process, method, system, product, or apparatus that includes a series of steps or units is not limited to the steps or units listed, but may optionally include steps or units not listed, or may optionally include other steps or units inherent to such processes, methods, products, or apparatus.

[0043] To facilitate understanding of the present invention, specific embodiments of the present invention will be described in more detail below with reference to the accompanying drawings.

[0044] NK / T-cell lymphoma is a highly aggressive and heterogeneous malignant tumor. Currently, the application of multi-omics classification in clinical practice is limited by technical complexity, lack of standardized protocols, and high costs.

[0045] To address this challenge, this invention utilizes machine learning algorithms to develop a proteomics-based molecular subtype prediction model for NK / T-cell lymphoma. This model demonstrates consistent discriminative ability in both the training and independent validation cohorts, showing high consistency with multi-omics subtype classification results. By employing a streamlined proteome, the model achieves efficient molecular subtype classification while maintaining predictive accuracy. In the independent validation cohort, the model-predicted subtypes exhibited varying complete remission rates and progression-free survival, revealing the molecular heterogeneity of NK / T-cell lymphoma and demonstrating its potential for subtype-specific precision therapy.

[0046] like Figure 1 As shown, the embodiments of the present invention specifically include the following steps:

[0047] Step S100: Obtain a multidimensional dataset of NK / T cell lymphoma samples. The multidimensional dataset includes molecular feature data of the samples and corresponding four-category molecular subtype labels. The molecular feature data includes at least one of proteomics expression data, transcriptomics data, and genomic variation data.

[0048] Multi-omics analysis of newly diagnosed NK / T-cell lymphoma patients identified four molecular subtypes with distinct biological characteristics and therapeutic targets: C1 immune depletion subtype, C2 immune desert subtype, C3 proliferative subtype, and C4 immune inflammatory subtype. The four subtype labels correspond to the aforementioned molecular subtype categories.

[0049] Among them, the immune-depleted subtype is characterized by CD8 + T cell dysfunction and high expression of PD-1 and TIM3; functional experiments showed that dual blockade of PD-1 and TIM3 had a synergistic effect. The immune desert subtype was characterized by immune cell deficiency and CD70 overexpression; preclinical models showed its sensitivity to CD70-targeted CAR-NK cell therapy. The proliferative subtype was characterized by CDK11B amplification and enhanced cell cycle activity; CDK inhibitors combined with PD-1 blockade therapy enhanced antitumor activity. The immune inflammatory subtype was rich in cytotoxic T cell infiltration and was associated with good clinical prognosis.

[0050] First, patient and clinicopathological data were collected: This example included 470 newly diagnosed NK / T-cell lymphoma patients. Patients were included if they met the following criteria: histologically confirmed NK / T-cell lymphoma and had formalin-fixed paraffin-embedded (FFPE) tumor tissue available for multi-omics analysis. Patients who had received systemic therapy (including chemotherapy, immunotherapy, or radiotherapy) prior to tumor tissue collection were excluded. Clinical and pathological information was collected from all patients. Baseline characteristics included age, sex, clinical stage, Eastern Cooperative Oncology Group (ECOG) performance status score, tumor location, lymph node involvement, immunohistochemical (IHC) markers, and circulating Epstein-Barr virus (EBV) DNA levels. Tumor EBV infection status was determined by EBV-encoded RNA (EBER) in situ hybridization (ISH), and circulating EBV DNA levels were detected by quantitative polymerase chain reaction (qPCR). The histological diagnosis of NK / T-cell lymphoma was independently assessed by two experienced hematologists according to WHO classification criteria. Thirty-five untreated tumor samples from this cohort were selected for single-cell RNA sequencing (scRNA-seq). For spatial transcriptome analysis, 14 FFPE tumor samples were analyzed using Xenium spatial transcriptome technology; these samples were from 14 corresponding patients in the scRNA-seq cohort.

[0051] Secondly, whole-exome sequencing (WES) data was collected and subjected to quality control, somatic mutation and copy number variation analysis, and mutation characterization. Batch transcriptome sequencing data was also collected and quality controlled. Full protein expression data was obtained using LC-MS / MS technology and then processed. The above data collection, processing, and analysis procedures are standard techniques in this field and will not be elaborated further.

[0052] After obtaining the above data, Cohesive Clustering Assignment (COCA) was performed. Specifically, NK / T-cell lymphoma samples were independently identified from four omics detection platforms: somatic mutation, copy number variation, transcriptomics, and proteomics. The subtype identification results from each platform were converted into binary indicator matrices. These binary indicator matrices were then imported into the ConsensusClusterPlus R package for consensus clustering. The PAM algorithm was used for clustering, based on Pearson correlation distance, and 1000 iterations of 80% resampling were performed in clusters of 2 to 10. The optimal number of clusters was selected based on the relative increasing trend of the area under the cumulative distribution function (CDF) to obtain the subtype classification of NK / T-cell lymphoma.

[0053] Single-cell RNA sequencing data were also collected, and unsupervised clustering and cell type annotation, single-cell copy number variation inference, differential expression and functional enrichment analysis, cell subpopulation distribution preference analysis, and Xenium in situ analysis were performed. These methods are standard techniques in this field and will not be elaborated further here.

[0054] The collected molecular characteristic data and analysis results, along with the four-category molecular subtype labels for each sample determined in conjunction with clinicopathological analysis results, constitute a multidimensional dataset. Molecular characteristic data may include proteomics expression data (such as whole protein expression data obtained by LC-MS / MS detection) and / or genomic variation data (such as somatic mutation data and copy number variation data obtained by whole exome sequencing (WES)).

[0055] Step S200: Iteratively partition the multi-dimensional dataset and train the XGBoost classifier using the partitioned dataset to obtain core classification features. The feature importance of the XGBoost classifier is represented by the average gain of the features across all partitions.

[0056] By combining multiple rounds of random validation with XGBoost classifier training, the core classification features that are most discriminative for the four-class classification task are selected from a large number of candidate variables, ensuring the reliability of feature selection and the generalization ability of the model.

[0057] like Figure 2 As shown, the specific steps of this embodiment include:

[0058] Step S210: Iteratively divide the multidimensional dataset into several training and validation sets using random seeds;

[0059] We set 1000 unique random seeds, and based on each random seed, iteratively divided the NK / T cell lymphoma dataset into training and validation sets at a ratio of 4:1, completing 1000 dataset partitions. By using a large number of random seeds to achieve multiple rounds of different sample allocations, we can effectively eliminate the randomness of a single dataset partition and provide a stable data foundation for subsequent feature importance assessment.

[0060] Step S220: For the training set obtained from each partition, train an XGBoost classifier optimized for four-class classification;

[0061] For each training set obtained from the partitioning, an XGBoost classifier optimized for four-class classification is trained. This XGBoost classifier uses a loss function adapted to four-class classification tasks (such as a multi-class log loss function) to ensure that the model can effectively learn the feature differences between different subtypes.

[0062] Step S230: Rank all features by importance in the XGBoost classifier obtained in each iteration of training, aggregate the importance ranking results of all iterations of training, obtain the cumulative ranking, sort the cumulative ranking in ascending order, and select the top preset number of features as the core classification features.

[0063] In each iteration of training, the XGBoost classifier outputs the "contribution / importance" of each feature to the classification result, using this as the core evaluation metric to rank all features. The feature ranking results from 1000 iterations are then aggregated. Specifically, this can be done by calculating the average rank of each feature across the 1000 rankings, sorting the average rank from smallest to largest, and selecting the top 20%-30% of features by average rank as the core classification features. This process ensures that the selected features maintain stable high importance across different data partitioning scenarios, improving the predictive reliability of the features.

[0064] Specifically, the expression for calculating the average gain of the feature across all partitions in each iteration is:

[0065] ,

[0066] in, This represents the number of decision trees in the XGBoost classifier. Representation of features For its branch tree The accuracy improvement brought about by this The calculation expression is:

[0067] ,

[0068] in, This represents the sum of gradients for the samples in the left subtree. This represents the sum of gradients for the samples in the right subtree. This is the sum of the second derivatives of the samples in the left subtree. This is the sum of the second derivatives of the samples in the right subtree. For regularization parameters, This is the minimum gain threshold for splitting the decision tree.

[0069] This embodiment uses 1000 random seeds to achieve multi-round dataset partitioning. Combined with XGBoost feature average gain calculation and ranking aggregation, it effectively eliminates the bias caused by the randomness of data partitioning. The core classification features selected have stable high predictive ability in different scenarios, which significantly improves the reliability of feature selection.

[0070] Step S300: Use linear discriminant analysis to pre-evaluate the four-class classification potential of the core classification features. If the pre-evaluation results do not meet the target, re-acquire the core classification features.

[0071] Linear discriminant analysis (LDA) was used to pre-evaluate the four-category discrimination potential of the selected core classification features. LDA verifies whether these features can truly distinguish the four NK / T-cell lymphoma subtypes based on the data distribution by maximizing the separation between different categories and minimizing the variance within each category. If the LDA results show significant discrimination potential (high between-class separation, low within-class variance, and satisfactory discrimination accuracy), it proves that the discriminative ability of the core features is an inherent property of the data itself, rather than a random result of model iteration, and the subsequent XGBoost modeling can proceed safely. If the LDA results show poor discrimination potential, it indicates that the selected core features actually have no effective discriminative ability, and it is necessary to return to step S200 to re-obtain the core classification features. Therefore, linear discriminant analysis can effectively verify the discriminative ability of core features for the four-category subtypes of NK / T-cell lymphoma.

[0072] Step S400: Based on the core classification features, train the XGBoost classification model using a multi-dimensional dataset to obtain the target classification model for molecular subtyping of NK / T cell lymphoma.

[0073] Using core classification features, the multi-dimensional dataset is re-partitioned into training and test sets (the split ratio can still be 4:1). An XGBoost classification model is trained on the training set, and its hyperparameters are optimized using grid search to maximize classification accuracy. Optimized hyperparameters include at least two of the following: learning rate, tree depth, number of leaf nodes, minimum sample weights, etc. Simultaneously, 5-fold cross-validation is used to determine the optimal parameter configuration, further improving the model's generalization ability. Optimizing hyperparameters through grid search and determining the optimal configuration through cross-validation effectively reduces the risk of model overfitting.

[0074] In this embodiment, a confusion matrix is ​​also used, and each category is treated as a binary classification task with other categories. Receiver Operating Characteristic (ROC) curves and Area Under the Curve (AUC) are plotted to evaluate the performance of the XGBoost classification model. Specifically, the multi-class classification is transformed into multiple binary classifications, and the AUC of each binary classification is calculated and averaged to evaluate the model's ability to distinguish between categories. The confusion matrix is ​​used to visualize the predictive performance of a specific category, clearly showing the model's classification accuracy and misclassification rate for each subtype. This dual evaluation strategy provides both a detailed view of the performance of a specific category and a comprehensive indicator of the model's overall discriminative ability.

[0075] In some implementations, to further verify the performance advantages of the XGBoost classification model, a linear regression (LR) model and a support vector classifier (SVM) model are selected, and their performance is evaluated based on the same core classification features and dataset, and compared with the performance of the XGBoost classification model.

[0076] In some implementations, the LDA validation results output specific discrimination details (such as which subtypes are well distinguished, which two subtypes are easily confused, and which feature contributes the most to inter-class separation). These discriminative details can guide XGBoost modeling. For example, if LDA shows that subtype 1 and subtype 2 are easily confused, then during subsequent XGBoost modeling, hyperparameters can be adjusted (such as increasing tree depth or learning rate), or derived features that can distinguish these two subtypes can be added to the features. If certain core features contribute the most in LDA, then during subsequent XGBoost training, the importance of these features can be emphasized to ensure that the model focuses on the core discrimination dimensions.

[0077] Testing showed that the target classification model in this embodiment achieved the best overall performance. In the training queue (n = 251), the area under the curve (AUC) of the target classification model was 1.00, 0.96, and 0.95, respectively, with AUCs of 0.95 and 0.95 for subtypes C1-C4. Confusion matrix analysis showed high accuracy for C1 (66 / 70 correct) and C4 (61 / 69 correct), while the misclassification rate was low between subtypes C2 and C3. In the internal test queue (n = 62), the model maintained robust performance, with AUCs of 0.94, 0.84, 0.84, and 0.89 for C1-C4. The corresponding confusion matrix again showed high accuracy for subtypes C1 and C4 (16 / 18 and 15 / 17 correct, respectively). Kaplan-Meier analysis performed in an external validation cohort (n = 157) showed that the proteomics-based classifier effectively stratified patients according to progression-free survival (PFS). Across the entire cohort, the C4 subtype exhibited the best PFS, while the C2 subtype showed the shortest (P < 0.0001). A similar pattern was observed in the subgroups receiving immunochemotherapy, where the C4 subtype maintained the best PFS, followed by the C1 subtype, and finally the C2 and C3 subtypes (P < 0.0001). Across the entire cohort, the C4 subtype had the highest complete response rate (91%), while the C2 subtype had the lowest (42%) (P < 0.0001). In patients receiving immunochemotherapy, both the C4 and C1 subtypes achieved a complete response rate of 94%, higher than the 50% for the C2 subtype and the 61% for the C3 subtype (P = 0.0047). Overall, these data suggest that the target classification model can reproduce subtypes defined by multiple omics.

[0078] In summary, this embodiment integrates multi-dimensional molecular feature data, including genomics, transcriptomics, proteomics, and single-cell RNA sequencing data, to optimize the XGBoost classifier for the four molecular subtypes of NK / T-cell lymphoma. It fully explores the feature differences between subtypes to obtain core classification features. The discriminative potential of these core features is pre-assessed using LDA to ensure their effective distinguishing ability. The XGBoost classification model is then trained using these core classification features to improve the accuracy and robustness of the four molecular subtype classification of NK / T-cell lymphoma. The selected core classification features can be further developed into diagnostic biomarkers for NK / T-cell lymphoma molecular subtypes, and the constructed target classification model can be directly applied to the subtype classification of NK / T-cell lymphoma patients in clinical practice.

[0079] like Figure 3As shown, based on the above-mentioned machine learning-based NK / T-cell lymphoma multi-omics molecular typing classification method, this invention discloses a machine learning-based NK / T-cell lymphoma multi-omics molecular typing classification system, including:

[0080] Data set module 600 is used to acquire a multidimensional dataset of NK / T cell lymphoma samples. The multidimensional dataset includes molecular feature data of the samples and corresponding four-category molecular subtype labels. The molecular feature data includes at least one of proteomics expression data, transcriptomics data, and genomic variation data.

[0081] The core classification feature module 610 is used to iteratively divide the multi-dimensional dataset and train an XGBoost classifier using the divided dataset to obtain core classification features. The feature importance of the XGBoost classifier is represented by the average gain of the features across all divisions.

[0082] The pre-evaluation module 620 is used to pre-evaluate the four-class classification potential of the core classification features using linear discriminant analysis. When the pre-evaluation result does not meet the target, the core classification features are re-acquired.

[0083] Training module 630 is used to train an XGBoost classification model based on the core classification features and the multi-dimensional dataset to obtain a target classification model for molecular subtyping of NK / T cell lymphoma.

[0084] In some implementations, the core classification feature module includes an iterative partitioning unit, a classifier training unit, and a ranking unit;

[0085] The iterative partitioning unit is used to iteratively partition the multidimensional dataset into several training sets and validation sets using a random seed;

[0086] The classifier training unit is used to train an XGBoost classifier optimized for four-class classification for each training set obtained from the partition.

[0087] The ranking unit is used to rank the importance of all features in the XGBoost classifier obtained in each iteration of training, aggregate the importance ranking results of all iterations of training to obtain a cumulative ranking, sort the cumulative ranking in ascending order, and select the top preset number of features as the core classification features.

[0088] like Figure 4 As shown, an embodiment of the present invention discloses an electronic device, including a memory 401 storing executable program code and a processor 402 coupled to the memory 401;

[0089] The processor 402 calls the executable program code stored in the memory 401 to execute the machine learning-based NK / T cell lymphoma multi-omics molecular subtyping classification method described in the above embodiments.

[0090] This invention also discloses a computer-readable storage medium storing a computer program that causes a computer to execute the machine learning-based NK / T cell lymphoma multi-omics molecular subtyping classification method described in the above embodiments.

[0091] The purpose of the above embodiments is to reproduce and derive the technical solution of the present invention by way of example, and to fully describe the technical solution, purpose and effect of the present invention. The purpose is to enable the public to have a more thorough and comprehensive understanding of the disclosure of the present invention, and not to limit the scope of protection of the present invention.

[0092] The above embodiments are not an exhaustive list based on the present invention, and there may be many other embodiments not listed. Any substitutions and improvements made without departing from the concept of the present invention are within the protection scope of the present invention.

Claims

1. A multi-omics molecular subtyping method for NK / T-cell lymphoma based on machine learning, characterized in that, include: A multidimensional dataset of NK / T cell lymphoma samples was obtained. The multidimensional dataset contained molecular feature data of the samples and corresponding four-category molecular subtype labels. The molecular feature data included at least one of proteomics expression data, transcriptomics data, and genomic variation data. The multidimensional dataset is iteratively partitioned and an XGBoost classifier is trained using the partitioned dataset to obtain core classification features. The feature importance of the XGBoost classifier is represented by the average gain of the features across all partitions. Linear discriminant analysis is used to pre-evaluate the four-class classification potential of the core classification features. If the pre-evaluation results do not meet the target, the core classification features are re-acquired. Based on the core classification features, the XGBoost classification model is trained using the multi-dimensional dataset to obtain a target classification model for molecular subtyping of NK / T cell lymphoma.

2. The machine learning-based multi-omics molecular subtyping method for NK / T-cell lymphoma as described in claim 1, characterized in that, When training the XGBoost classification model, the hyperparameters of the XGBoost classification model are optimized through grid search to maximize classification accuracy, and the optimal parameter configuration is determined by cross-validation on the training data.

3. The machine learning-based multi-omics molecular subtyping method for NK / T-cell lymphoma as described in claim 1, characterized in that, The iterative partitioning of the multi-dimensional dataset and the training of an XGBoost classifier using the partitioned dataset to obtain core classification features include: The multidimensional dataset is iteratively divided into several training and validation sets using random seeds; For each training set obtained from the partitioning, train the XGBoost classifier optimized for four-class classification; In the XGBoost classifier obtained in each iteration of training, all features are ranked by importance. The importance ranking results of all iterations of training are aggregated to obtain a cumulative ranking. The cumulative ranking is sorted in ascending order, and the top preset number of features are selected as the core classification features.

4. The machine learning-based multi-omics molecular subtyping method for NK / T-cell lymphoma as described in claim 3, characterized in that, The expression for calculating the average gain of the feature across all partitions is: , in, This represents the number of decision trees in the XGBoost classifier. Representation of features For its branch tree The accuracy improvement brought about by this The calculation expression is: , in, This represents the sum of gradients for the samples in the left subtree. This represents the sum of gradients for the samples in the right subtree. This is the sum of the second derivatives of the samples in the left subtree. This is the sum of the second derivatives of the samples in the right subtree. For regularization parameters, This is the minimum gain threshold for splitting the decision tree.

5. The machine learning-based multi-omics molecular subtyping method for NK / T-cell lymphoma as described in claim 1, characterized in that, The performance of the XGBoost classification model was evaluated by plotting receiver operating characteristic (ROC) curves and calculating the area under the curves, using a confusion matrix and treating each category as a decision task of binary classification with other categories.

6. The machine learning-based multi-omics molecular subtyping method for NK / T-cell lymphoma as described in claim 1, characterized in that, When acquiring a multidimensional dataset of NK / T-cell lymphoma samples, subtypes of NK / T-cell lymphoma samples were independently identified from four omics detection platforms: somatic mutation, copy number variation, transcriptomics, and proteomics. The subtype identification results of each platform were converted into binary indicator matrices. Consensus clustering was performed on the binary indicator matrices, and the optimal number of clusters was selected based on the relative increasing trend of the area under the cumulative distribution function to obtain the subtype classification of NK / T-cell lymphoma.

7. A machine learning-based multi-omics molecular subtyping system for NK / T-cell lymphoma, characterized in that, include: The dataset module is used to obtain a multidimensional dataset of NK / T cell lymphoma samples. The multidimensional dataset contains molecular feature data of the samples and corresponding four-category molecular subtype labels. The molecular feature data includes at least one of proteomics expression data, transcriptomics data, and genomic variation data. The core classification feature module is used to iteratively partition the multi-dimensional dataset and train an XGBoost classifier using the partitioned dataset to obtain core classification features. The feature importance of the XGBoost classifier is represented by the average gain of the features across all partitions. The pre-evaluation module is used to pre-evaluate the four-class classification potential of the core classification features using linear discriminant analysis. When the pre-evaluation result does not meet the target, the core classification features are re-acquired. The training module is used to train the XGBoost classification model based on the core classification features and the multi-dimensional dataset to obtain a target classification model for molecular subtyping of NK / T cell lymphoma.

8. The machine learning-based NK / T-cell lymphoma multi-omics molecular subtyping system as described in claim 7, characterized in that, The core classification feature module includes an iterative partitioning unit, a classifier training unit, and a ranking unit; The iterative partitioning unit is used to iteratively partition the multidimensional dataset into several training sets and validation sets using a random seed; The classifier training unit is used to train an XGBoost classifier optimized for four-class classification for each training set obtained from the partition. The ranking unit is used to rank the importance of all features in the XGBoost classifier obtained in each iteration of training, aggregate the importance ranking results of all iterations of training to obtain a cumulative ranking, sort the cumulative ranking in ascending order, and select the top preset number of features as the core classification features.

9. An electronic device, characterized in that, It includes a memory storing executable program code and a processor coupled to the memory; the processor calls the executable program code stored in the memory to execute the machine learning-based NK / T cell lymphoma multi-omics molecular subtyping classification method according to any one of claims 1-6.

10. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program, wherein the computer program causes a computer to execute the machine learning-based NK / T-cell lymphoma multi-omics molecular subtyping classification method according to any one of claims 1-6.