A combination of protein biomarkers for screening multiple cancer types and its applications

The combination of protein biomarkers constructed using mass spectrometry and machine learning algorithms has solved the problems of insufficient sensitivity and specificity in multi-cancer screening, enabling accurate identification and classification of multiple cancers and improving the accuracy and efficiency of cancer screening.

CN122307104APending Publication Date: 2026-06-30GENESEEQ TECH INC +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
GENESEEQ TECH INC
Filing Date
2026-05-29
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing technologies lack a systematically constructed, clinically validated combination of shared protein biomarkers for multiple cancer types and its standardized detection protocol. This results in insufficient sensitivity and specificity of existing serum/plasma tumor biomarkers in early cancer screening and cancer differentiation, making it impossible to effectively achieve non-invasive and efficient screening for multiple cancer types.

Method used

Using mass spectrometry combined with machine learning algorithms, a combination of protein biomarkers was constructed, including multiple proteins such as IGLV4-60 and IGLV2-18. Through a stratified screening strategy, plasma samples were used to identify multiple cancer types, including colorectal cancer, gastric cancer, liver cancer, lung cancer, ovarian cancer, and pancreatic cancer.

Benefits of technology

It achieves accurate identification and classification of multiple cancer types, improves the accuracy and sensitivity of screening, has high specificity and high sensitivity, can effectively distinguish between healthy and cancer samples, and accurately distinguish different cancer types, and has potential for clinical application.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122307104A_ABST
    Figure CN122307104A_ABST
Patent Text Reader

Abstract

This invention discloses a combination of protein biomarkers for screening multiple cancer types and its applications, belonging to the field of cancer diagnostic technology. The combination includes 164 candidate proteins such as IGLV4-60 and TPM3. This invention also provides an in vitro detection method, which involves extracting proteins from plasma or serum samples of subjects, enzymatically digesting them, and then using liquid chromatography-mass spectrometry to obtain the relative expression levels of the proteins; subsequently, the expression levels are input into a classification model constructed based on a machine learning algorithm. This invention employs a stratified screening strategy of first binary classification and then multi-class classification, which can not only distinguish between healthy individuals and cancer patients with high sensitivity and specificity, but also further accurately classify colorectal cancer, gastric cancer, liver cancer, lung cancer, ovarian cancer, and pancreatic cancer. This invention achieves non-invasive, high-throughput parallel screening and tissue tracing of multiple cancer types, possessing extremely high clinical application value.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to a combination of protein biomarkers for screening multiple cancer types and their applications, belonging to the field of cancer diagnostic technology. Background Technology

[0002] Malignant tumors are a leading cause of death worldwide, with both incidence and mortality rates continuing to rise. For solid tumors such as stomach cancer, liver cancer, lung cancer, pancreatic cancer, ovarian cancer, and colorectal cancer, diagnosis currently relies primarily on gastroscopy / colonoscopy or histopathological biopsy. These methods are invasive, have poor patient compliance, and often lead to missed opportunities for early detection and intervention. Therefore, there is an urgent need to develop a non-invasive, efficient, and highly sensitive screening method for multiple cancers to improve patient survival rates and intervention outcomes.

[0003] Plasma, a readily available and information-rich bodily fluid, contains a large number of proteins that reflect the body's physiological and pathological states, making it an important sample source for the development of clinical disease biomarkers. Recent studies have shown that combinations of plasma protein biomarkers exhibit higher sensitivity and specificity compared to single biomarkers, and their concentration changes can accurately reflect systemic physiological alterations in an individual during tumor development and progression. Therefore, developing multi-indicator combined detection methods based on plasma proteomics shows significant potential in multi-cancer identification and personalized risk assessment.

[0004] Currently, commonly used serum / plasma tumor markers (such as CEA, CA19-9, AFP, CA125, etc.) are mostly single indicators, primarily used for disease monitoring and efficacy evaluation. Their sensitivity and specificity are generally insufficient in early cancer screening and cancer differentiation, and they are not suitable for parallel screening of multiple cancer types. Furthermore, the expression of these traditional markers may be interfered with by inflammation, age, or other non-tumor factors, further limiting their clinical value.

[0005] Mass spectrometry, as a crucial tool in proteomics research, possesses high-throughput, high-sensitivity, and high-precision qualitative and quantitative capabilities, and has been widely applied in the screening, validation, and quantitative analysis of disease-related protein biomarkers. Furthermore, the development of various targeted and non-targeted mass spectrometry detection technologies has laid the technological foundation for high-precision and high-throughput protein biomarker detection. Simultaneous quantification of multiple proteins in plasma using mass spectrometry holds promise for constructing precise models for the early detection of multiple cancer types.

[0006] Nevertheless, there is currently a lack of a systematically constructed, clinically validated, and mass spectrometry-based combination of shared protein biomarkers for multiple cancer types, along with standardized detection protocols. Existing research largely focuses on single cancer types or specific proteins, lacking integrated cross-cancer identification models and systematic validation of candidate protein combinations in real-world clinical cohorts. Therefore, developing a protein combination, detection method, and its application system in screening and classification that can accurately identify multiple cancer types in plasma is of significant scientific importance and has broad clinical application prospects. Summary of the Invention

[0007] To achieve the goal of accurately identifying multiple cancer types in plasma, this invention provides a combination of protein biomarkers for screening multiple cancer types.

[0008] A combination of protein biomarkers for screening multiple cancers consists of the following proteins: IGHV4-60, IGHV2-18, IGHV3-12, IGHV3OR16-12, IGKV1OR2-108, IGHV3OR15-7, IGHV3OR16-10, TPM3, PARVB, GPX3, IGHV1-45, IGHV3-49, IGHV3-21, IGHV9-49, IGLL5 or IGLC1, GP1BA, IGHV1-3, IGHV1-18, IGHV4-61, IGKV2-24 or IGKV2D-24, IGHV2-70, A0A0J9YY99, SEPP1, A LDOB, CCDC110, LDHB, IGLC7, CALR, AGT, HSPA5, HSP90B1, IGFBP3, PROC, DST, HSPA8 or HSPA2, CFL1, F7, MYL6, SERPINA3, COMP, ZYX, GNPTG, CP, MYL12A or MYL 12B, C3, PDLIM1, CLIC1, APOL1, WDR1, VNN1, SDPR, LDHA, F13A1, HPR, CA1, IGF2, IGKV3-20, IGLV1-47, IGLV2-11, IGLV3-1, IGHV3-7, IGHV2-5, IGKC, SLC4 A1, CRP, FN1, PPBP, PF4, ANG, F11, CAT, LCAT, IGLV7-43, VWF, GAPDH, ITGB3, S100A8, SERPINA5, BCHE, S100A9, ENO1, CSF1R, HSP90AA1, ITGA2B, LPA, PLEK , FCGR3A, SPARC, SAA1, SAA2, IGLC3, IGHV1-8, IGKV1D-13, SRGN, SLC2A1, CETP, ACTN1, LAMP2, PRG2, SELL, PKM, ANPEP, IGLL1, EPB42, IGFBP2, VCL, VCAM1 PZP, FLNA, IGFBP4, STOM, GRN, PRDX6, BLVRB, PRDX2, MAN1A1, MYH9, IGFALS, TAGLN2, BTD, LIMS1, HSPA7 or HSPA6, CRISP3, APOC4, INHBC, CDH13, MTPN, DEFA3 or DEFA1, ACTB or ACTG1, RAP1B or RAP1A, LYZ, YWHAZ, TPM4, ACTA1 or ACTC1 or ACTG2 or ACTA2, TUBA1B or TUBA1A or TUBA3E, HBB, HBA1, IGLV3-21, CAP1, FGL1, SPP2ILK, PCOLCE, PON3, RSU1, TGFBI, ADIPOQ, ECM1, TUBB, PI16, FERMT3, CFHR4, CNDP1, RARRES2, MENT, COLEC11, CFHR5, CDHR2, APMAP, CRTAC1, C1RL, PCYOX1, TLN1, KBTBD3. ,

[0009] The aforementioned protein biomarker combination is used in the preparation of products for in vitro screening, diagnosis, or auxiliary diagnosis of various cancers.

[0010] The application includes a stratified screening strategy: First stage: Construct a binary classification model using the protein biomarker combination for preliminary classification of cancer patients and healthy individuals; Second stage: After preliminary classification into cancer patients, construct a multi-classification model using the protein biomarker combination for further classification of different cancer types; The different cancer types include one or more of colorectal cancer, gastric cancer, liver cancer, lung cancer, ovarian cancer, and pancreatic cancer.

[0011] The method also includes the following steps: a. Obtaining in vitro plasma or serum samples from the subject and extracting a whole protein mixture; b. Denaturing, reducing, alkylating, and enzymatically digesting the whole protein mixture to generate corresponding peptides; c. Separating the enzymatically digested peptides using liquid chromatography; d. Introducing the separated peptides into a mass spectrometer for detection to obtain mass spectra of primary precursor ions and secondary fragment ions; e. Calculating the relative expression levels of corresponding proteins in the protein biomarker combination based on the mass spectra and the signal responses of characteristic polypeptides of the proteins; f. Inputting the relative expression levels into a pre-trained classification model to output the subject's health status or cancer category.

[0012] The specific process parameters for extracting the whole protein mixture and enzymatic digestion described in steps a and b include: extracting proteins using solvent precipitation; diluting plasma or serum samples with a 20-300 mM ammonium bicarbonate solution at a volume ratio of 1:5-1:20; then adding pre-cooled methanol for precipitation, with a volume ratio of the diluent to methanol of 1:1-3:1; redissolving the precipitated protein in 6-10 M urea; adding a reducing agent to a final concentration of 2-10 mM for denaturation and reduction; then adding an alkylating agent to a final concentration of 5-20 mM for a light-protected reaction; diluting the urea concentration to below 1 M; and then adding digestive enzymes at an enzyme-to-substrate mass ratio of 1:20-1:1000 for enzymatic digestion.

[0013] The separation conditions for liquid chromatography described in step c are as follows: the column specification is a C18 column with a particle size of 1.5-3.0 μm; mobile phase A is an aqueous solution containing 0.01-1.0% formic acid, and mobile phase B is a 60-95% acetonitrile solution containing 0.01-1.0% formic acid; the separation gradient duration is 20-150 minutes.

[0014] The mass spectrometry detection parameters mentioned in step d include: the MS1 scan range for the first-stage mass spectrometer is 300-1800 m / z, the resolution is 50000-300000, the automatic gain control (AGC) is 100%-600%, the ion implantation time is 10-200 ms, and the ion charge number is 2-5 valence states; the MS2 scan for the second-stage mass spectrometer has 3-30 data-dependent second-stage spectra, the isolation window is 0.5-5.0 m / z, the resolution is 10000-100000, the automatic gain control (AGC) is 20-400%, the ion implantation time is 20-400 ms, and the high-energy collision dissociation (HCD) collision energy is 20-40%.

[0015] In step e, protein qualitative analysis is performed by comparing the protein against a human protein database using search software, and quantitative analysis is performed using a label-free protein quantification algorithm. The ratio of paired common peptides between different samples is calculated, and protein abundance is reconstructed based on least squares optimization, thereby obtaining the relative expression level of the protein.

[0016] The process of constructing the pre-trained classification model in step f includes: screening differentially expressed proteins using a linear model based on the empirical Bayesian method, with the screening criteria being an absolute value of expression fold change ≥ 1.5 and a p-value < 0.05; optimizing the model using a K-fold cross-validation strategy, where K is 3-10; and constructing the classification model based on at least one machine learning algorithm selected from the following: Minimum Absolute Shrinkage and Selection Operator (LASSO), Random Forest (RF), Support Vector Machine (SVM), and Extreme Gradient Boosting (XGBoost).

[0017] The classification model mentioned in step f includes a binary classification model and a multi-class classification model; the output of the subject's health status or cancer category specifically includes: firstly, using the binary classification model to initially distinguish between healthy and cancer samples; then, using the multi-class classification model to further identify samples determined to be cancerous in order to distinguish specific cancer types; the specific cancer types include colorectal cancer, gastric cancer, liver cancer, lung cancer, ovarian cancer, and pancreatic cancer.

[0018] The beneficial effects of this invention are:

[0019] 1. Strong ability to identify multiple cancer types: The combination of plasma protein biomarkers screened and verified by this invention can effectively distinguish between healthy individuals and six common malignant tumors: colorectal cancer, gastric cancer, liver cancer, lung cancer, ovarian cancer, and pancreatic cancer. It has good broad-spectrum ability and improves the feasibility and accuracy of simultaneous screening for multiple cancer types.

[0020] 2. High sensitivity and high specificity: The protein combination described in this invention exhibits excellent diagnostic performance in multiple cancer types and has potential for clinical application.

[0021] 3. Precise detection combined with algorithm-assisted intelligent discrimination: Mass spectrometry has advantages such as high detection throughput, good repeatability, and accurate quantification. Furthermore, by introducing machine learning models, it can achieve intelligent classification among multiple cancer types, reduce human error, and improve diagnostic efficiency. Attached Figure Description

[0022] Figure 1 Distribution of Feature Contribution (SHAP) values ​​in a cancer-healthy person classification model.

[0023] Figure 2 Distribution of Feature Contribution (SHAP) values ​​in a colorectal cancer classification model.

[0024] Figure 3 Distribution of Feature Contribution (SHAP) values ​​in a liver cancer classification model.

[0025] Figure 4 Distribution of Feature Contribution (SHAP) values ​​in lung cancer classification models.

[0026] Figure 5 Distribution of Feature Contribution (SHAP) values ​​in a pancreatic cancer classification model.

[0027] Figure 6 Distribution of Feature Contribution (SHAP) values ​​in a gastric cancer classification model.

[0028] Figure 7 Distribution of Feature Contribution (SHAP) values ​​in an ovarian cancer classification model.

[0029] Figure 8 : ROC curve of the validation set subjects (cancer vs. healthy people).

[0030] Figure 9 Validation set receiver operating characteristic (ROC) curve (colorectal cancer stratification).

[0031] Figure 10 : Validation set receiver operating curve ROC (gastric cancer stratification).

[0032] Figure 11 : Validation set subject operating curve ROC (pancreas stratification).

[0033] Figure 12 : Validation set receiver operating curve (ROC) (hepatocellular carcinoma stratification).

[0034] Figure 13 : Validation set receiver operating curve (ROC) (lung cancer stratification).

[0035] Figure 14 : Validation set receiver operating curve (ROC, stratification of ovarian cancer). Detailed Implementation

[0036] This invention provides a combination of protein biomarkers for screening various cancers and their applications. It includes the following steps:

[0037] 1. Sample collection:

[0038] Collect plasma or serum samples and extract a mixture of total proteins using solvent precipitation or magnetic bead enrichment.

[0039] 2. Enzymatic digestion of proteins:

[0040] The protein mixture is then subjected to enzymatic digestion, with trypsin being a commonly used digestive enzyme. This step is used to obtain the corresponding peptides of the proteins.

[0041] 3. Peptide separation:

[0042] Liquid chromatography (LC) was used to separate the peptides after enzymatic digestion in order to improve the coverage of mass spectrometry analysis.

[0043] 4. Mass spectrometry analysis:

[0044] The separated peptides were introduced into a mass spectrometer for detection and analysis. The mass-to-charge ratio (m / z) of the primary precursor ion and secondary fragment ions of the peptides was detected to obtain the peptide mass spectra of each protein.

[0045] 5. Data Processing:

[0046] Mass spectrometry results were used to analyze characteristic peptides of each protein. Based on the primary precursor ion spectrum and the m / z values ​​and abundance of the resolved secondary fragment peaks, the relative expression levels of each protein were determined. Differential proteins between cancer patients and healthy individuals were analyzed using methods such as t-tests, and a model was constructed for screening and differentiating multiple cancer types.

[0047] In the following embodiments, protein biomarker detection and analysis are performed through the following steps:

[0048] 1. Sample collection:

[0049] A total of 266 plasma samples were collected from patients and healthy volunteers who had signed informed consent forms (26 with colorectal cancer, 32 with gastric cancer, 18 with liver cancer, 28 with lung cancer, 15 with ovarian cancer, 13 with pancreatic cancer, and 135 healthy individuals). Protein mixtures were obtained using solvent precipitation or magnetic bead enrichment. This embodiment uses the solvent precipitation method. The specific steps are as follows: A certain volume of plasma was diluted in a 50-200mM ammonium bicarbonate solution at a ratio of 1:10. Pre-cooled methanol was then added to the plasma diluent at a ratio of 2:1 (plasma diluent: methanol). After vortexing and mixing, the mixture was centrifuged, the supernatant was discarded, and the precipitate was vacuum dried and stored at -80°C for later use.

[0050] 2. Enzymatic digestion of proteins:

[0051] The protein mixture is enzymatically digested, with trypsin being a commonly used digestive enzyme. This step is used to obtain the corresponding peptides of the protein. The specific steps are as follows: Dissolve the dried precipitated protein in 8M urea, add 50mM MTCEP solution as a reducing agent to a final concentration of 5mM, heat to denature and reduce, then add 100mM IAA solution as an alkylating agent to a final concentration of 10mM, and react at room temperature in the dark for 30 minutes. Dilute the urea to below 1M with 50-200mM ammonium bicarbonate solution, then add Lys-C enzyme and Trypsin enzyme sequentially at an enzyme / substrate mass ratio of 1:50-1:500. After mixing and reacting overnight, add 10% TFA to a final concentration of 0.1% to terminate the reaction. After column desalting, vacuum dry the mixture and reconstitute with 0.1% FA before use.

[0052] 3. Peptide separation:

[0053] Liquid chromatography (LC) was used to separate the reconstituted peptides to improve the coverage of mass spectrometry analysis. The LC system was a Vanquish Neo UHPLC system, and the mass spectrometer was an Orbitrap Exploris 480 (ThermoFisher Scientific). The column specifications were 2 μm × 75 μm × 15 cm C18 column. Mobile phase A was water (containing 0.1% formic acid), and mobile phase B was 80% acetonitrile (containing 0.1% formic acid). The separation gradient time was 102 min.

[0054] 4. Mass spectrometry analysis:

[0055] The separated peptides were introduced into a mass spectrometer for detection and analysis. Mass spectra of different proteins were obtained by detecting the mass-to-charge ratio (m / z) of the primary precursor ion and secondary fragment ions. This example uses a non-targeted detection method, but it is not limited to non-targeted detection analysis and is also applicable to targeted detection methods. The mass spectrometer was an Orbitrap Exploris 480 (Thermo Fisher Scientific). The MS1 scan range was 375-1550 m / z, with a resolution of 120,000, an AGC of 300%, an IT of 100 ms, and ion charges of 2-5 valence states. The MS2 scan had 15 data-dependent secondary spectra, an isolation window of 2 m / z, a resolution of 30,000, an AGC of 50%, an ion implantation time of 100 ms, and an HCD collision energy of 32%.

[0056] 5. Data Processing:

[0057] Using MaxQuant 2.0.3.0 search software and referencing the UniProt human database (downloaded in August 2024), mass spectrometry data were analyzed to identify characteristic peptides of each protein. Based on the primary precursor ion spectrum and the m / z values ​​and abundance of the resolved secondary fragment peaks, the relative expression levels of each protein were determined. A significant difference test was performed on the common proteins identified in healthy and cancer samples. Statistical methods employed included t-tests or limma models based on empirical Bayesian methods. The final selection criteria were set as follows: absolute fold change (|Fold Change|) ≥ 1.5, 2, 3, or 4 with a p-value < 0.05. Differentially expressed proteins were selected for modeling to differentiate between healthy and cancerous samples and their respective categories. The test samples were randomly divided into a 7:3 ratio into a discovery set and a validation set. The linear model method in the limma package was used to analyze differentially expressed proteins between cancer patients and healthy individuals. Proteins were screened to build models for screening and differentiating multiple cancer types. The overall modeling process adopted a hierarchical screening strategy: first, the model was used to initially distinguish between healthy and cancer samples, filtering out suspected cancer individuals from the population; then, cancer samples were further subjected to multi-classification identification to differentiate between different cancer types, thus forming a hierarchical screening system of "first binary classification, then multi-classification". The classification accuracy of differentially expressed protein feature modeling under different screening parameters is shown in Table 2. The contribution of each differentially expressed protein in different classification models was obtained when the absolute value of the difference fold change (|Fold Change|) ≥ 2 was optimally selected. Figures 1-7 As shown (its proteins are shown in Table 1).

[0058] Table 1 Protein biomarkers

[0059] Number Protein Group Gene Name 1 A0A075B6I1 IGLV4-60 2 A0A075B6J9 IGLV2-18 3 A0A075B6K2 IGLV3-12 4 A0A075B7B8 IGHV3OR16-12 5 A0A075B7D4 IGKV1OR2-108 6 A0A075B7D8 IGHV3OR15-7 7 A0A075B7F0 IGHV3OR16-10 8 A0A087WWU8 TPM3 9 A0A087WZB5 PARVB 10 A0A087X1J7 GPX3 11 A0A0A0MS14 IGHV1-45 12 A0A0A0MS15 IGHV3-49 13 A0A0B4J1V1 IGHV3-21 14 A0A0B4J1Y8 IGLV9-49 15 A0A0B4J231 IGLL5;IGLC1 16 A0A0C4DGZ8 GP1BA 17 A0A0C4DH29 IGHV1-3 18 A0A0C4DH31 IGHV1-18 19 A0A0C4DH41 IGHV4-61 20 A0A0C4DH68 IGKV2-24;IGKV2D-24 21 A0A0J9YVU5 IGHV2-70 22 A0A0J9YY99 A0A0J9YY99 23 A0A182DWH7 SEPP1 24 A0A3B3IS80 ALDOB 25 A0A494C037 CCDC110 26 A0A5F9ZHM4 LDHB 27 A0A5H1ZRQ7 IGLC7 28 A0A7P0T861 CALR 29 A0A7P0T8D1 AGT 30 A0A7P0TAI0 HSPA5 31 A0A7P0TAY2 HSP90B1 32 A6XND1 IGFBP3 33 E7END6 PROC 34 E9PHM6 DST 35 E9PKE3 HSPA8;HSPA2 36 E9PP50 CFL1 37 F5H8B0 F7 38 F8W1R7 MYL6 39 G3V3A0 SERPINA3 40 G3XAP6 COMP 41 H0Y2Y8 ZYX 42 H0YEA7 GNPTG 43 H7C5R1 CP 44 J3QRS3 MYL12A;MYL12B 45 M0R1Q1 C3 46 O00151 PDLIM1 47 O00299 CLIC1 48 O14791 APOL1 49 O75083 WDR1 50 O95497 VNN1 51 O95810 SDPR 52 P00338 LDHA 53 P00488 F13A1 54 P00739 HPR 55 P00915 CA1 56 P01344 IGF2 57 P01619 IGKV3-20 58 P01700 IGLV1-47 59 P01706 IGLV2-11 60 P01715 IGLV3-1 61 P01780 IGHV3-7 62 P01817 IGHV2-5 63 P01834 IGKC 64 P02730 SLC4A1 65 P02741 CRP 66 P02751 FN1 67 P02775 PPBP 68 P02776 PF4 69 P03950 ANG 70 P03951 F11 71 P04040 CAT 72 P04180 LCAT 73 P04211 IGLV7-43 74 P04275 VWF 75 P04406 GAPDH 76 P05106 ITGB3 77 P05109 S100A8 78 P05154 SERPINA5 79 P06276 BCHE 80 P06702 S100A9 81 P06733 ENO1 82 P07333 CSF1R 83 P07900 HSP90AA1 84 P08514 ITGA2B 85 P08519 LPA 86 P08567 PLEK 87 P08637 FCGR3A 88 P09486 SPARC 89 P0DJI8 SAA1 90 P0DJI9 SAA2 91 P0DOY3 IGLC3 92 P0DP01 IGHV1-8 93 P0DP09 IGKV1D-13 94 P10124 SRGN 95 P11166 SLC2A1 96 P11597 CETP 97 P12814 ACTN1 98 P13473 LAMP2 99 P13727 PRG2 100 P14151 SELL 101 P14618 PKM 102 P15144 ANPEP 103 P15814 IGLL1 104 P16452 EPB42 105 P18065 IGFBP2 106 P18206 VCL 107 P19320 VCAM1 108 P20742 PZP 109 P21333 FLNA 110 P22692 IGFBP4 111 P27105 STOM 112 P28799 GRN 113 P30041 PRDX6 114 P30043 BLVRB 115 P32119 PRDX2 116 P33908 MAN1A1 117 P35579 MYH9 118 P35858 IGFALS 119 P37802 TAGLN2 120 P43251 BTD 121 P48059 LIMS1 122 P48741 HSPA7;HSPA6 123 P54108 CRISP3 124 P55056 APOC4 125 P55103 INHBC 126 P55290 CDH13 127 P58546 MTPN 128 P59666 DEFA3;DEFA1 129 P60709 ACTB;ACTG1 130 P61224 RAP1B;RAP1A 131 P61626 LYZ 132 P63104 YWHAZ 133 P67936 TPM4 134 P68133 ACTA1;ACTC1;ACTG2;ACTA2 135 P68363 TUBA1B;TUBA1A;TUBA3E 136 P68871 HBB 137 P69905 HBA1 138 P80748 IGLV3-21 139 Q01518 CAP1 140 Q08830 FGL1 141 Q13103 SPP2 142 Q13418 ILK 143 Q15113 PCOLCE 144 Q15166 PON3 145 Q15404 RSU1 146 Q15582 TGFBI 147 Q15848 ADIPOQ 148 Q16610 ECM1 149 Q5ST81 TUBB 150 Q6UXB8 PI16 151 Q86UX7 FERMT3 152 Q92496 CFHR4 153 Q96KN2 CNDP1 154 Q99969 RARRES2 155 Q9BUN1 MENT 156 Q9BWP8 COLEC11 157 Q9BXR6 CFHR5 158 Q9BYE9 CDHR2 159 Q9HDC9 APMAP 160 Q9NQ79 CRTAC1 161 Q9NZP8 C1RL 162 Q9UHG3 PCYOX1 163 Q9Y490 TLN1 164 U3KQF6 KBTBD3 This invention, based on the R language environment, employs four algorithms—Least Absolute Shrinkage and Selection Operator (LASSO), Random Forest (RF), Support Vector Machine (SVM), and Extreme Gradient Boosting (XGBoost)—to construct predictive models. During the model tuning phase, all models utilize repeated 5-fold cross-validation, with the Area Under the Receiver Operating Characteristic (AUC) used as the core evaluation metric for hyperparameter optimization. Specific parameter settings are as follows: (1) LASSO model: Implemented by calling the glmnet method in the caret package, setting its distribution family to binomial (family = "binomial"), L1, fixing the regularization term scaling parameter (α) to 1, and setting the grid search range of the regularization penalty coefficient (λ) to 10. -3 Up to 10 1 The 20 logarithmic divisions between them are equal.

[0060] (2) RF Model: Training is performed using the RF method from the caret package and the randomForest package. The number of decision trees (ntree) is fixed at 500, and a grid search is used to optimize the number of candidate features (mtry) at node splits, with the search interval set to... (where p is the total number of input features).

[0061] (3) SVM Model: The svmRadial method of the caret package is used, which calls the e1071 package to build a classifier based on the Radial Basis Function (RBF). Its hyperparameters are also optimized through grid search, with the candidate search intervals for the penalty coefficient (C) and kernel parameter (σ) both set to 2. -2 to 2 2 .

[0062] (4) XGBoost model: Built based on the xgboost package, its objective function is set to binary logistic regression (objective = "binary:logistic"), and the model evaluation metric is set to AUC. During training, the number of parallel computing threads is set to 4 (nthread = 4), and early stopping mechanism is enabled, with the early stopping tolerance rounds set to 10 (early_stopping_rounds = 10) to effectively prevent model overfitting by automatically selecting the optimal number of iterations.

[0063] The results showed that the classification models constructed from combinations of differentially expressed proteins performed well, such as... Figures 8 - Figures 14As shown, the AUC of the model distinguishing between healthy individuals and cancer in the validation set ranged from 0.887 to 0.961. The AUCs of the models distinguishing each cancer subtype in the validation set were as follows: colorectal cancer AUC = 0.818-0.954, gastric cancer AUC = 0.938-0.994, pancreatic cancer AUC = 0.763-0.868, liver cancer AUC = 0.856-0.933, lung cancer AUC = 0.788-0.862, and ovarian cancer AUC = 0.735-0.968. The classification results of the validation set are shown in Table 3. In 41 real healthy samples, the model correctly identified 39 cases (TN), with only 2 cases misclassified as cancer, achieving a specificity of 95.12%. This indicates that the method of the present invention has an extremely low false positive rate, effectively avoiding unnecessary disturbance to healthy individuals and possessing good potential for clinical screening applications. In 42 real cancer samples (all validation cancer types, including 8 cases of colorectal cancer, 10 cases of gastric cancer, 6 cases of liver cancer, 9 cases of lung cancer, 5 cases of ovarian cancer, and 4 cases of pancreatic cancer), 38 cases were detected as positive by the model (TP), with only 4 cases missed (FN), resulting in an overall detection sensitivity (TPR) of 90.48%. The model's positive predictive value (PPV) in the validation set was 95.00%, and its negative predictive value (NPV) was 90.70%, further confirming the high reliability of the detection results. Furthermore, in multiple cancer screening, it correctly predicted 5 out of 8 colorectal cancer samples; 10 out of 10 gastric cancer samples; 2 out of 6 liver cancer samples; 6 out of 9 lung cancer samples; 4 out of 5 ovarian cancer samples; and 3 out of 4 pancreatic cancer samples. In summary, the technical solution proposed in this invention can not only distinguish between cancer patients and healthy individuals with high sensitivity and specificity, but also accurately trace the tissue origin of tumors in most cases, especially for specific cancers such as gastric cancer, with high accuracy in tracing the source.

[0064] Table 2. Classification accuracy results of differential protein feature modeling under different screening parameters.

[0065] Table 3 shows the accuracy results of the model's predictions on the validation set.

[0066]

[0067] It is evident that the protein biomarkers of this invention have good accuracy in cancer type stratification.

Claims

1. A combination of protein biomarkers for screening multiple cancers, characterized in that it is It is composed of the following proteins: IGHV4-60, IGHV2-18, IGHV3-12, IGHV3OR16-12, IGKV1OR2-108, IGHV3OR15-7, IGHV3OR16-10, TPM3, PARVB, GPX3, IGHV1-45, IGHV3-49, IGHV3-21, IGHV9-49, IGLL5 or IGLC1, GP1BA, IGHV1-3, IGHV1-18, IGHV4-61, IGKV2-24 or IGKV2D-24, IGHV2-70, A0A0J9YY99, SEPP1, ALDOB, CCDC110, LDHB, IGL C7, CALR, AGT, HSPA5, HSP90B1, IGFBP3, PROC, DST, HSPA8 or HSPA2, CFL1, F7, MYL6, SERPINA3, COMP, ZYX, GNPTG, CP, MYL12A or MYL12B, C3, PDLIM1, CLIC1, A POL1, WDR1, VNN1, SDPR, LDHA, F13A1, HPR, CA1, IGF2, IGKV3-20, IGLV1-47, IGLV2-11, IGLV3-1, IGHV3-7, IGHV2-5, IGKC, SLC4A1, CRP, FN1, PPBP, PF4, A NG, F11, CAT, LCAT, IGLV7-43, VWF, GAPDH, ITGB3, S100A8, SERPINA5, BCHE, S100A9, ENO1, CSF1R, HSP90AA1, ITGA2B, LPA, PLEK, FCGR3A, SPARC, SAA1, SA A2, IGLC3, IGHV1-8, IGKV1D-13, SRGN, SLC2A1, CETP, ACTN1, LAMP2, PRG2, SELL, PKM, ANPEP, IGLL1, EPB42, IGFBP2, VCL, VCAM1, PZP, FLNA, IGFBP4, STOM GRN, PRDX6, BLRRB, PRDX2, MAN1A1, MYH9, IGFALS, TAGLN2, BTD, LIMS1, HSPA7 or HSPA6, CRISP3, APOC4, INHBC, CDH13, MTPN, DEFA3 or DEFA1, ACTB or ACTG1, RAP1B or RAP1A, LYZ, YWHAZ, TPM4, ACTA1 or ACTC1 or ACTG2 or ACTA2, TUBA1B or TUBA1A or TUBA3E, HBB, HBA1, IGLV3-21, CAP1, FGL1, SPP2, ILK, PCOLCE, PON3, RSU1TGFBI、ADIPOQ、ECM1、TUBB、PI16、FERMT3、CFHR4、CNDP1、RARRES2、MENT、COLEC11、CFHR5、CDHR2、APMAP、CRTAC1、C1RL、PCYOX1、TLN1、KBTBD3。、 2. The use of the protein biomarker combination of claim 1 in the preparation of products for in vitro screening, diagnosis or auxiliary diagnosis of various cancers.

3. The application according to claim 2, characterized in that, The application includes a stratified screening strategy: First stage: a binary classification model is constructed using the protein biomarker combination to initially classify cancer patients and healthy individuals; Second stage: after initially classifying them as cancer patients, a multi-classification model is constructed using the protein biomarker combination to further classify different types of cancer. The different types of cancer include one or more of the following: colorectal cancer, stomach cancer, liver cancer, lung cancer, ovarian cancer, and pancreatic cancer.

4. The application according to claim 3, characterized in that, It also includes the following steps: a. Obtain in vitro plasma or serum samples from the subject and extract a mixture of whole proteins; b. Denature, reduce, alkylate, and enzymatically digest the whole protein mixture to generate the corresponding peptides; c. Use liquid chromatography to separate the peptide fragments after enzymatic hydrolysis; d. The separated peptides are introduced into a mass spectrometer for detection to obtain mass spectra of the primary precursor ion and secondary fragment ions; e. Based on the mass spectra, the relative expression levels of the corresponding proteins in the protein biomarker combination of claim 1 are calculated using the signal response of the characteristic polypeptides of the protein; f. The relative expression levels are input into a pre-trained classification model to output the health status or cancer category of the subject.

5. The application according to claim 4, characterized in that, The specific process parameters for extracting the whole protein mixture and enzymatic digestion described in steps a and b include: extracting proteins using solvent precipitation; diluting plasma or serum samples with a 20-300 mM ammonium bicarbonate solution at a volume ratio of 1:5-1:20; then adding pre-cooled methanol for precipitation, with a volume ratio of the diluent to methanol of 1:1-3:1; redissolving the precipitated protein in 6-10 M urea; adding a reducing agent to a final concentration of 2-10 mM for denaturation and reduction; then adding an alkylating agent to a final concentration of 5-20 mM for a light-protected reaction; diluting the urea concentration to below 1 M; and then adding digestive enzymes at an enzyme-to-substrate mass ratio of 1:20-1:1000 for enzymatic digestion.

6. The application according to claim 4, characterized in that, The separation conditions for liquid chromatography described in step c are as follows: the column specification is a C18 column with a particle size of 1.5-3.0 μm; mobile phase A is an aqueous solution containing 0.01-1.0% formic acid, and mobile phase B is a 60-95% acetonitrile solution containing 0.01-1.0% formic acid; the separation gradient duration is 20-150 minutes.

7. The application according to claim 4, characterized in that, The mass spectrometry detection parameters mentioned in step d include: the MS1 scan range for the first-stage mass spectrometer is 300-1800 m / z, the resolution is 50000-300000, the automatic gain control (AGC) is 100%-600%, the ion implantation time is 10-200 ms, and the ion charge number is 2-5 valence states; the MS2 scan for the second-stage mass spectrometer has 3-30 data-dependent second-stage spectra, the isolation window is 0.5-5.0 m / z, the resolution is 10000-100000, the automatic gain control (AGC) is 20-400%, the ion implantation time is 20-400 ms, and the high-energy collision dissociation (HCD) collision energy is 20-40%.

8. The application according to claim 4, characterized in that, In step e, protein qualitative analysis is performed by comparing the protein against a human protein database using search software, and quantitative analysis is performed using a label-free protein quantification algorithm. The ratio of paired common peptides between different samples is calculated, and protein abundance is reconstructed based on least squares optimization, thereby obtaining the relative expression level of the protein.

9. The application according to claim 4, characterized in that, The process of constructing the pre-trained classification model in step f includes: screening differentially expressed proteins using a linear model based on the empirical Bayesian method, with the screening criteria being an absolute value of expression fold change ≥ 1.5 and a p-value < 0.05; optimizing the model using a K-fold cross-validation strategy, with K ranging from 3 to 10; and constructing the classification model based on at least one machine learning algorithm selected from the following: minimum absolute shrinkage and selection operator model, random forest model, support vector machine, and extreme gradient boosting model.

10. The application according to claim 4 or 9, characterized in that, The classification model mentioned in step f includes a binary classification model and a multi-class classification model; the output of the subject's health status or cancer category specifically includes: firstly, using the binary classification model to initially distinguish between healthy and cancer samples; then, using the multi-class classification model to further identify samples determined to be cancerous in order to distinguish specific cancer types; the specific cancer types include colorectal cancer, gastric cancer, liver cancer, lung cancer, ovarian cancer, and pancreatic cancer.