Method, system and storage medium for protein cleavage site identification based on mass spectrometry data
By constructing a protein site coverage matrix and performing statistical significance analysis, protein cleavage sites are automatically identified, solving the problems of low throughput and insufficient prior knowledge in existing technologies, and achieving efficient, non-human-interventional identification of protein cleavage sites.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- HUNAN UNIV
- Filing Date
- 2026-05-15
- Publication Date
- 2026-06-12
Smart Images

Figure CN122201453A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of protein detection technology, and in particular to a method, system, and storage medium for identifying protein cleavage sites based on mass spectrometry data. Background Technology
[0002] Protein cleavage is a crucial post-translational processing mechanism widely observed in eukaryotes and prokaryotes, playing a key role in protein maturation, activation, inactivation, subcellular localization, signal transduction, immune responses, and pathogen infection. Cleavage at specific sites produces fragments with different structures and functions; therefore, accurate identification of protein cleavage sites is essential for elucidating protein processing mechanisms, clarifying regulatory networks in biological processes, and discovering functionally valuable fragments.
[0003] Studies of protein cleavage sites primarily rely on low-throughput experimental validation (e.g., Western blot or immunoblotting), sequence prediction methods based on known restriction enzyme rules, or manual comparison and empirical judgment of peptides identified by mass spectrometry. While these methods have some applicability in specific scenarios, they generally suffer from several shortcomings in terms of systematicity, sensitivity, and scalability. First, traditional experimental methods (such as immunoblotting and N-terminal sequencing) have low throughput and are time-consuming. They typically require the design of antibodies or primers for each candidate cleavage event, making them unsuitable for analyzing the global cleavage event spectrum in large-scale samples, dynamic time series, or complex tissue systems. Furthermore, these methods often only detect pre-hypothesized cleavage products and struggle to discover unknown or unexpected cleavage sites.
[0004] Secondly, sequence prediction methods based on known enzyme specificity (such as SignalP, ProP, SitePrediction, etc.) rely heavily on existing enzyme-substrate knowledge bases and lack the ability to identify novel proteases, atypical cleavage patterns, or enzyme families that have not yet been annotated. Therefore, when faced with cleavage events in species that have not been fully studied or under non-classical physiological or pathological conditions, the predictive reliability of these methods decreases significantly, and they are prone to missing novel cleavage sites with important biological functions.
[0005] Third, existing mainstream mass spectrometry data analysis workflows (such as MaxQuant, Proteome Discoverer, and PEAKS) primarily focus on peptide identification, protein qualitative and quantitative analysis. Their outputs are typically presented as peptide lists or protein coverage, lacking dedicated algorithm modules for cleavage site identification, statistical testing frameworks (such as cleavage enrichment analysis and background noise modeling), and result visualization tools (such as cleavage maps, heatmaps, and structure mappings). Therefore, researchers often rely on manual comparison of mass spectra or consulting N-terminal / C-terminal peptide distributions for empirical judgment. This process is both subjective and inefficient, severely limiting the systematic mining and reproducibility assessment of large-scale cleavage site data.
[0006] In summary, existing methods have significant shortcomings in terms of throughput, reliance on prior knowledge, and the ability to professionally interpret cleavage signals in mass spectrometry data. Current protein cleavage site studies usually rely on manual comparison, low-throughput verification, or prediction methods based on known enzyme specificity, which are difficult to adapt to the systematic identification of unknown or non-classical cleavage events in complex mass spectrometry data. Summary of the Invention
[0007] The technical problem to be solved by the present invention is to provide a method, system and storage medium for identifying protein cleavage sites based on mass spectrometry data, which does not require any prior knowledge and only requires mass spectrometry data of the species to comprehensively identify the cleavage sites of any protein in the species.
[0008] To solve the above-mentioned technical problems, the technical solution adopted by the present invention is: a method for identifying protein cleavage sites based on mass spectrometry data, comprising the following steps: S1. Preprocess the raw mass spectrometry data to obtain the mass spectrometry-derived peptides corresponding to the input precursor protein sequence; S2. Align the mass spectrometry-derived peptides to the corresponding precursor protein sequences to construct a protein site coverage matrix; S3. Based on the protein site coverage matrix, identify potential cleavage regions and potential cleavage sites in the protein; S4. Perform statistical significance analysis on potential cleavage sites in the protein and extract the neighborhood sequences of the cleavage sites; S5. Utilize the statistical significance of potential cleavage sites in the protein to filter potential cleavage sites and obtain high-confidence protein cleavage sites.
[0009] This invention models the coverage characteristics of mass spectrometry-derived peptides in protein sequences (constructing a protein site coverage matrix) to identify potential cleavage regions, cleavage sites, and their neighboring sequence features in precursor proteins from mass spectrometry-derived peptides. No prior knowledge is required; only mass spectrometry data of the species are needed to comprehensively identify the cleavage sites of any protein in that species. This invention integrates preprocessing, cleavage site identification, and high-confidence filtering into a complete workflow. Starting from the raw mass spectrometry-identified peptides, it directly outputs cleavage site results and visualization information, automatically and efficiently extracting high-confidence cleavage sites from raw mass spectrometry peptide data without manual comparison or empirical judgment. It also provides intuitive and interactive visualization output, significantly improving the systematic nature and reproducibility of protein cleavage site research.
[0010] The method of the present invention further includes: S6. Output the protein cleavage site identification results; the identification results include the cleavage site neighborhood sequence corresponding to the high-confidence protein cleavage site, the statistical significance of potential cleavage sites in the protein, the cleavage region, the cleavage site location, and the coverage depth.
[0011] In this invention, the specific implementation process of step S1 includes: Clustering the raw mass spectrometry data yields a clustered spectral library; The input protein sequence and the spectral library are used as input to the Comet software to obtain the mass spectrometry-derived peptides corresponding to the precursor protein sequence.
[0012] The specific implementation process of step S2 includes: Each mass spectrometry-derived peptide was compared with its corresponding precursor protein sequence to determine the start and end positions of each mass spectrometry-derived peptide in the precursor protein. Using each amino acid site of the precursor protein as an index, the number of peptides covering each amino acid site, their frequency of occurrence, or quantitative intensity are counted to obtain the coverage of each amino acid site. The position frequency matrix is then constructed using the coverage to obtain the protein site coverage matrix. In the position frequency matrix, the coverage C of the i-th amino acid site i Defined as the number of times a peptide is identified by mass spectrometry covering the i-th amino acid site, the calculation formula is as follows: ; Where N represents the total number of mass spectrometry-identified peptides aligned to the precursor protein, and s j and e j These represent the start and end positions of the j-th mass spectrometry-derived peptide in the precursor protein, respectively. I is an indicator function that takes the value 1 when the j-th mass spectrometry-derived peptide covers the i-th amino acid site, and 0 otherwise.
[0013] This invention statistically analyzes the support of each amino acid site being covered by peptides, which facilitates the extraction of potential cleavage information from peptide distribution.
[0014] The specific implementation process of step S3 includes: Based on the protein site coverage matrix, a coverage curve for each amino acid site of the precursor protein is constructed, and the coverage curve is smoothed. Peak detection is performed on the smoothed coverage curve to identify local coverage enrichment peaks; Each of the local coverage enrichment peaks is defined as a candidate peak, the left and right boundaries of each candidate peak define a potential cutting region, and the center position of the candidate peak is defined as a potential cutting site.
[0015] After a protein is cleaved, the surrounding area often exhibits localized abnormal enrichment of peptide coverage. This invention identifies potential cleavage events by measuring coverage distribution and uses statistical methods to identify local enrichment peaks. This enables efficient, large-scale, and high-confidence identification of protein cleavage sites without relying on pre-defined cleavage enzyme rules.
[0016] In step S4, the specific implementation process of performing statistical significance analysis on potential cleavage sites in the protein includes: The p-value, used to characterize the statistical significance of the corresponding potential cleavage site, is calculated using the following formula: ; Where X is a random variable that follows a Poisson distribution with parameter λ. M represents the total number of sites with a coverage greater than 0, and C i denoted as the coverage of the i-th amino acid site, and x is the observation coverage of the center site of a candidate peak.
[0017] This invention performs statistical tests on candidate cutting sites, which helps to distinguish between real cutting events and random noise.
[0018] In step S4, the specific implementation process of extracting the sequence in the neighborhood of the cut site includes: Map the left boundary, center position, and right boundary of each candidate peak back to the coordinates of the corresponding precursor protein sequence; Precursor protein sequences within a predetermined range upstream and downstream of potential cleavage sites are extracted and used as neighborhood sequences of the cleavage sites.
[0019] The specific implementation process of step S5 includes: Remove candidate peaks located within low-quality alignment regions; the low-quality alignment region refers to a region where the number of uniquely mappable supporting peptides is less than a preset threshold T. p The area; For the remaining candidate peaks, retain the candidate peaks whose P-value is less than the preset significance threshold α; Calculate the relative coverage score of the retained candidate peaks, and define potential cleavage sites that simultaneously meet the following conditions as high-confidence cleavage sites: the number of uniquely mappable supporting peptides is greater than 2, the P-value is less than the preset significance threshold α, and the relative coverage score R is not lower than the preset threshold. The formula for calculating the relative coverage score R is as follows: C peak The coverage of the candidate peak center site, This indicates the average background coverage of the corresponding precursor protein.
[0020] Since low-quality alignment regions or candidate peaks with fewer supporting peptides are prone to false positive results, it is necessary to further filter the aforementioned potential cleavage sites to further reduce random noise interference and improve the robustness of cleavage site detection.
[0021] As an inventive concept, the present invention also provides a protein cleavage site identification system based on mass spectrometry data, including a memory, a processor, and a computer program stored in the memory; the processor executes the computer program to implement the steps of the above method.
[0022] As an inventive concept, the present invention also provides a computer-readable storage medium having a computer program / instructions stored thereon; when the computer program / instructions are executed by a processor, they implement the steps of the above-described method.
[0023] Compared with the prior art, the beneficial effects of the present invention are as follows: 1. This invention requires no prior knowledge; it can comprehensively identify the cleavage sites of any protein in any species using only mass spectrometry data from any species. It completely eliminates reliance on known enzyme-specific rules or protease annotation databases. Traditional methods suffer from a sharp decline in predictive ability or complete failure when faced with novel proteases, non-classical cleavage patterns, or insufficiently studied species due to a lack of prior knowledge. This invention only requires inputting standard mass spectrometry peptide data from any species (including model organisms, non-model organisms, even multi-species mixed samples or metaproteomics data) to systematically identify the cleavage sites of any protein in the species' proteome. This invention can directly identify protein cleavage sites based on peptide information obtained from mass spectrometry, offering a high degree of automation and suitability for large-scale sample analysis. This invention has low dependence on known protease cleavage rules and can identify cleavage events mediated by unknown or non-classical cleavage patterns, expanding the applicability of protein cleavage site identification.
[0024] 2. By integrating preprocessing, cleavage site identification, and high-confidence filtering into a complete workflow, this invention can directly output cleavage site results and visualization information from raw mass spectrometry peptide identification. This invention constructs a fully automated and standardized processing pipeline from raw mass spectrometry peptides to biological conclusions, significantly reducing the user's operational threshold and analysis time costs. The entire process requires no manual intervention, avoiding the inefficiency and subjective bias of traditional methods that rely on manual comparison.
[0025] 3. Based on protein coverage distribution, smoothing, peak detection, and Poisson statistical tests, this invention achieves systematic identification of potential cleavage events, reducing random noise interference and improving the robustness of cleavage site detection. In this invention, smoothing reduces false triggering by random noise, peak detection provides sensitive identification proportional to signal intensity, and the Poisson test achieves rigorous statistical quality control. After cleavage site identification, this invention further combines supporting peptide number, relative coverage, and statistical tests for filtering, obtaining high-confidence cleavage site results and reducing the false positive rate.
[0026] 4. This invention models the coverage characteristics of peptides derived from mass spectrometry within protein sequences and uses statistical methods to identify local enrichment peaks. This enables efficient, large-scale, and high-confidence identification of protein cleavage sites without relying on pre-defined cleavage enzyme rules. This invention can simultaneously output the cleavage site location, neighboring amino acid sequences, and a visual map, facilitating subsequent cleavage motif analysis, protease substrate studies, functional fragment mining, and biological mechanism elucidation.
[0027] 5. This invention is easy to use, directly identifying protein cleavage sites from peptides derived by mass spectrometry, and automatically outputting the site location, neighboring sequences, and coverage map. It integrates noise reduction, peak detection, and Poisson significance test filtering processes, resulting in more systematic and reliable identification results.
[0028] 6. This invention has strong versatility and can be applied to protein cleavage site identification, protein processing mechanism research, post-translational processing analysis, protease substrate screening and related biotechnology fields, and has good prospects for promotion and application. Attached Figure Description
[0029] Figure 1 This is a schematic flowchart of the protein cleavage site identification method of Embodiment 1 of this application; Figure 2 This is a comparison of the running time of peptide identification using library search software and cleavage site identification using the method of this application at scales of 100, 1000, and 10000 maps in Embodiment 2 of this application. Figure 3 This is a graph showing the memory usage of the method in Embodiment 2 of this application running with 100 atlas data. Figure 4 This is a graph showing the memory usage of the method in Embodiment 2 of this application running with 1000 atlas data. Figure 5 This is a graph showing the memory usage of the method in Embodiment 2 of this application during the execution of the method with 10,000 atlas data. Figure 6 This is a graph showing the results of the enrichment analysis of host protein cleavage substrate function under SARS-CoV-2 infection conditions in Example 3 of this application; Figure 7 The results of the cleavage site analysis of the spike glycoprotein (Uniprot database number P0DTC2) in Example 3 of this application; Figure 8 The results of the cleavage site analysis of the ORF3a protein (Uniprot database number P0DTC3) in Example 3 of this application; Figure 9 The results of the cleavage site analysis of the ORF1a polyprotein (Uniprot database number P0DTC1) in Example 3 of this application; Figure 10 The results of the cleavage site analysis for the nucleocapsid protein (Uniprot database number P0DTC9) in Example 3 of this application; Figure 11 The results of the cleavage site analysis of the ORF1ab polyprotein (Uniprot database number P0DTD1) in Example 3 of this application; Figure 12 The results of the cleavage site analysis for the membrane protein (Uniprot database number P0DTC5) in Example 3 of this application are shown. Detailed Implementation
[0030] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0031] Example 1 This embodiment provides a method for identifying protein cleavage sites based on mass spectrometry. See [link to relevant documentation]. Figure 1 This includes the following steps: 1. Preprocess the raw mass spectrometry data to obtain peptide information identified by mass spectrometry, and match the peptides to their corresponding protein sequences. The peptide data identified by mass spectrometry is obtained through mass spectrometry experiments combined with spectral retrieval. The peptide information can be sourced from publicly available proteomics databases such as ProteomeCentral, or from raw mass spectrometry data of self-tested samples after identification and analysis.
[0032] Mass spectrometry experiments typically generate a large number of peptides and their corresponding spectral information, which originate from precursor proteins in the sample being tested. To identify protein cleavage sites, it is necessary to first match the mass spectrometry-derived peptides with the protein sequence and determine the specific location of each peptide within the protein.
[0033] Specifically, in this step, the user inputs the protein sequence and uses the clustered spectral library as input data. The peptide spectrum is matched using Comet software to obtain the mass spectrometry-derived peptides and their corresponding precursor proteins. At the same time, based on the observed mass shift information, the peptides are annotated with post-translational modifications to provide basic data for subsequent cleavage site identification.
[0034] 2. Align the peptides obtained from mass spectrometry to the corresponding precursor protein sequences to construct a protein site coverage matrix.
[0035] In order to extract potential cleavage information from peptide distribution, it is necessary to statistically analyze the support of each amino acid site being covered by the peptide.
[0036] Specifically, in this step, each peptide is aligned with its precursor protein sequence to determine the start and end positions of each peptide within the precursor protein. Then, using each amino acid site of the precursor protein as an index, the number of peptides covering that site, their frequency of occurrence, or quantitative intensity is counted to construct a position-frequency matrix (Formula 1). The position-frequency matrix characterizes the support strength of peptides at different positions in the precursor protein, thus forming the protein coverage distribution required for subsequent analysis. For example, if a precursor protein is 300 amino acids long, and a 10-amino acid identification peptide aligns to positions 51-60 of that protein, then that peptide covers positions 51-60 of the precursor protein once. If other peptides also align to the same position, the coverage counts are accumulated. After completing the above alignment for all peptides, the number of peptides covering each amino acid site of the precursor protein, their frequency of occurrence, or quantitative intensity is counted to construct a position-frequency matrix, which characterizes the support strength of peptides at different positions in the protein sequence, thereby forming a protein site coverage matrix.
[0037] Formula 1: Coverage C of the i-th amino acid site i Defined as the number of times a peptide is identified by mass spectrometry covering this site, its calculation formula is: ; Where N represents the total number of mass spectrometry-identified peptides that were matched to the precursor protein, and s j and e j These represent the start and end positions of the j-th peptide in the precursor protein, respectively. I is an indicator function that takes the value 1 when the j-th peptide covers the i-th amino acid site, and 0 otherwise.
[0038] 3. Based on the site coverage matrix, identify potential cleavage regions and potential cleavage sites in proteins.
[0039] After a protein is cleaved, the surrounding area often exhibits a localized aberrant enrichment of peptide coverage, thus potential cleavage events can be identified through coverage distribution.
[0040] Specifically, in this step, coverage curves for each amino acid site of the precursor protein are constructed based on the aforementioned site coverage matrix, and the coverage curves are smoothed using a moving average method to reduce local noise and enhance peak signals. Subsequently, peak detection is performed on the smoothed coverage curves to identify locally enriched peaks. Each locally enriched peak is defined as a candidate peak, and each candidate peak corresponds to a potential cleavage region, which is defined by the left and right boundaries of the candidate peak. The center position of the candidate peak is defined as the potential cleavage site.
[0041] 4. Perform statistical significance analysis on potential cleavage sites and extract neighborhood sequence information of the cleavage sites.
[0042] To distinguish between real cutting events and random noise, statistical tests are needed on candidate cutting sites.
[0043] Specifically, in this step, the left boundary, center position, and right boundary of each candidate peak are mapped back to the coordinates of the corresponding precursor protein sequence, wherein the precursor protein sequence is the protein sequence corresponding to the aforementioned peptide alignment. Further, using the center position of the candidate peak as a potential cleavage site, amino acid sequences within a preset range upstream and downstream of this center position are extracted as the neighborhood sequence of the cleavage site. Simultaneously, a Poisson significance test is performed on each candidate peak. Let the observed coverage of the center position of a candidate peak be x, and the average coverage of all non-zero coverage sites in the precursor protein be used as the background parameter λ, then: ; Where M represents the total number of sites with a coverage greater than 0, and C i This represents the coverage of the i-th non-zero coverage site.
[0044] Based on a Poisson distribution with parameter λ, the probability that the coverage of the candidate peak's center site is not less than x is calculated as the P-value: ; Where X is a random variable following a Poisson distribution with parameter λ. The p-value is used to characterize the statistical significance of the corresponding potential cleavage site, and the neighborhood sequence of the cleavage site serves as the local sequence feature information of the potential cleavage site, which is used for subsequent result output, cleavage motif analysis, or further screening analysis.
[0045] Specifically, in this step, candidate peaks located in low-quality alignment regions are first removed, where the number of uniquely mappable supporting peptides in the low-quality alignment regions is less than a preset threshold T. p For the remaining candidate peaks, significance screening is performed based on the P-values calculated in step 4, and candidate peaks with P-values less than a preset significance threshold α are retained. Let the coverage of the center site of a candidate peak be C. peak , Let R represent the average background coverage of the corresponding precursor protein. Then, the relative coverage score R can be expressed as: ; Finally, potential cleavage sites that simultaneously meet the following conditions are defined as high-confidence cleavage sites: the number of supporting peptides that can be uniquely mapped is greater than 2, the P-value is less than the preset significance threshold α (greater than 0.05), and the relative coverage score R is not lower than the preset threshold (greater than 0.6). Accordingly, the neighboring sequences of the cleavage sites corresponding to the high-confidence cleavage sites are also retained as output results.
[0046] 6. Output the protein cleavage site identification results and visualize them.
[0047] To facilitate the interpretation of results and subsequent applications, it is necessary to provide a unified output and visualization of the recognition results.
[0048] Specifically, in this step, a result file including the cleavage region, cleavage site location, coverage depth, P value and its neighboring sequence information is output, and a coverage curve is automatically generated to show the distribution of the cleavage site; in some embodiments, an amino acid conservation map and a cleavage site sequence motif map can also be further generated.
[0049] Example 2 To illustrate the effectiveness of the protein cleavage site identification method, this embodiment compares the identification performance of the protein cleavage site identification method (CleavageFinder) of this application on known Arabidopsis thaliana and human protein cleavage site data. In this embodiment, the relevant parameters of CleavageFinder are set as follows: the coverage curve smoothing window is set to 5, that is, a sliding window of 5 amino acid sites in length is used to smooth the site coverage curve; the statistical significance screening condition for candidate cleavage sites is set to a p-value less than 1; the supporting peptide filtering condition is set to a number of uniquely mappable supporting peptides greater than 2; and the relative coverage score R is set to a screening condition not lower than a preset threshold of 0.6.
[0050] See Table 1, which shows the identification results of protein cleavage sites in Arabidopsis and humans in Example 2. The results show that CleavageFinder correctly identified 60 proteins with known cleavage sites on the Arabidopsis dataset, with a precision of 0.7229, recall of 0.7059, and F1 score of 0.7143 out of a total of 85 proteins. On the human dataset, it correctly identified 433 proteins with known cleavage sites, with a precision of 0.6405, recall of 0.6312, and F1 score of 0.6358 out of a total of 686 proteins. These results demonstrate that the method of this application has good protein cleavage site identification ability in both plant and animal data, and can accurately recover known cleavage events from mass spectrometry data, indicating that the method has good cross-species applicability and stability.
[0051] Table 1. Identification results of protein cleavage sites in Arabidopsis thaliana and humans in Example 2.
[0052] To illustrate the resource consumption of the method in this application under different data scales, the running time and memory usage of CleavageFinder were tested with 100, 1000, and 10000 map data.
[0053] Figure 2 The results show a comparison of runtime for peptide identification using library search software and cleavage site identification using the method of this application at scales of 100, 1000, and 10000 maps. Figure 2As can be seen, the time consumed in both steps increases with the number of spectra, but the running time of the proposed method for cleavage site identification is consistently significantly lower than that of the database search software. Specifically, with 100 spectra, the database search software takes 82.82 seconds, while the cleavage site identification takes only 0.44 seconds; with 1000 spectra, the database search software takes 354.78 seconds, while the cleavage site identification takes 2.92 seconds; and with 10000 spectra, the database search software takes 896.99 seconds, while the cleavage site identification takes 25.17 seconds. Particularly noteworthy is the fact that with 10000 spectra, the cleavage site identification step accounts for only a small portion of the overall process, indicating that the proposed method still maintains high time efficiency in large-scale spectra data processing scenarios. This result demonstrates that after completing the preceding peptide database search and identification, the proposed method can quickly complete the subsequent cleavage site identification analysis without significantly increasing the overall mass spectrometry data processing workflow, thus possessing significant practical application value. In other words, the method in this application does not introduce a significant additional time burden to the overall mass spectrometry data processing workflow and is suitable for use in large-scale spectral data scenarios. Figures 3 to 5 The memory usage curves during the operation of the method in this application are shown under conditions of 100, 1000, and 10000 atlas images, respectively. Figures 3 to 5 As can be seen, the maximum memory usage of the method increases with the number of maps, but the overall change is stable, indicating that the method in this application has good memory scalability. Among other things, Figure 3 The data shows that with 100 maps, memory usage quickly rises and stabilizes at around 13 MB, with an overall peak of approximately 13.19 MB. Figure 4 The data shows that with 1000 maps, the maximum memory usage further increases to approximately 15.62 MB. Figure 5 The data shows that with 10,000 atlas images, the maximum memory usage is approximately 30.53 MB. Despite the input data volume increasing by two orders of magnitude, the memory usage of the method in this application remains low, indicating minimal resource consumption. The trend shows that the method in this application did not exhibit abnormally drastic fluctuations in memory usage under different scales of atlas data, demonstrating good stability during operation. Even with 10,000 atlas images, its memory usage remains low, making it suitable for typical computing environments. Figures 2 to 5As can be seen, the method in Embodiment 2 of this application exhibits good performance in terms of both runtime and memory usage. On the one hand, the time consumed by the cleavage site identification step is much shorter than that of the preceding library search step, indicating that the method has high computational efficiency. On the other hand, the method maintains low memory usage across different spectral scales, and the memory usage shows a steady growth trend with the increase in data volume, indicating that it has good scalability and stability. Therefore, the method of this application is suitable for protein cleavage site identification and analysis of large-scale mass spectrometry data.
[0054] Example 3 To illustrate the application effect of the protein cleavage site identification method provided in this application in a viral infection scenario, the protein cleavage events during SARS-CoV-2 infection of host cells were systematically analyzed using the CleavageFinder protein cleavage site identification method of this application. The analyzed data included 4 infected groups and 4 uninfected groups. In this embodiment, the relevant parameters were set as follows: First, candidate peaks detected on the precursor protein coverage curve were used as potential cleavage regions, and the center position of the candidate peaks was defined as potential cleavage sites; when performing statistical significance testing on the candidate peaks, the average coverage of non-zero coverage sites was used as the background parameter λ, and candidate sites with P-values less than a preset significance threshold were retained as significant candidate sites, preferably 0.05. Further, only candidate sites with at least 2 supporting peptides were retained; at the same time, each candidate site was normalized according to the total number of peptides of the corresponding precursor protein, and a relative coverage score was calculated, preferably retaining those with a relative coverage score not lower than the preset threshold of 0.5. Furthermore, protein cleavage events showed good reproducibility in both the infected and uninfected groups across different biological replicates, with 84.28% and 84.09% of the cleaved proteins, respectively, being detected in at least two biological replicates. Based on these results, the analysis was further limited to cleaved proteins and their cleavage sites that occurred repeatedly in at least two samples from either the infected or uninfected group.
[0055] See Figure 6 , Figure 6 This figure shows the results of functional enrichment analysis of host protein cleavage substrates under SARS-CoV-2 infection conditions in Example 3 of this application. The results show that the cleavage substrates identified under infection conditions are mainly enriched in nuclear regulatory proteins, particularly in functional categories such as DNA damage response, transcriptional regulation, and RNA splicing. This indicates that viral infection can induce protein hydrolysis and reprogramming in the host nuclear genomics, thereby affecting host genome stability, cell cycle homeostasis, and post-transcriptional processing. Further analysis revealed that the cleaved proteins, specifically appearing only in the infected group, were significantly enriched in cytokines and translation factors, suggesting that there may be targeted cleavage regulation of the host immune signaling and protein translation system during viral infection.
[0056] Figures 7-12 The analysis results of protein cleavage sites under SARS-CoV-2 infection conditions in Example 3 of this application are shown. The results indicate that the method of this application can effectively identify cleavage regions on protein sequences in SARS-CoV-2 infection-related samples and further determine potential cleavage sites. Figures 7 to 12 The analysis results correspond to the cleavage sites of spike glycoprotein, ORF3a protein, ORF1a polymer, nucleocapsid protein, ORF1ab polymer, and membrane protein, respectively. Figure 7 The cleavage regions of the spike glycoprotein mediated by TMPRSS2 or CTSL were marked. Figure 8 The cleavage regions in the ORF3a protein involved in the polymerization reaction are marked. Figure 9 The cleavage regions in the ORF1a polyprotein mediated by PLPRO and 3CLPRO are indicated. Figure 10 The CASP6-mediated cleavage region in the nucleocapsid protein was marked. Figure 11 The 3CLPRO-mediated cleavage region in the ORF1ab polyprotein is indicated. Figure 7 The top box represents the peaks formed by the spike glycoprotein coverage spectrum, the blue area in the middle box represents the 7 cleavage regions identified by the protein, the red area in the bottom box represents the 2 known cleavage sites of the protein, and the red box marks the cleavage regions mediated by the known TMPRSS2 or CTSL proteases. Figure 8 The top box represents the peaks formed by the ORF3a protein coverage spectrum, the blue area in the middle box represents one cleavage region identified by the protein, the red area in the bottom box represents one known cleavage site of the protein, and the red box marks the cleavage region generated by the protein's participation in the polymerization reaction. Figure 9 The top box represents the peaks formed by the ORF1a polyprotein coverage spectrum, the blue area in the middle box represents the 7 cleavage regions identified in the protein, the red area in the bottom box represents the 10 known cleavage sites of the protein, and the red box marks the cleavage regions mediated by PL-PRO and 3CL-PRO proteases in the protein. Figure 10 The top box represents the peaks formed by the nucleocapsid protein coverage spectrum, the blue area in the middle box represents the three cleavage regions identified in the protein, the red area in the bottom box represents the one known cleavage site of the protein, and the red box marks the cleavage region mediated by the CASP6 protease in the protein. Figure 11 The top box represents the peaks formed by the ORF1ab polyprotein coverage spectrum, the blue area in the middle box represents the 8 cleavage regions identified in the protein, the red area in the bottom box represents the 14 known cleavage sites of the protein, and the two red boxes on the right indicate the cleavage regions mediated by 3CL-PRO in the protein. Figure 12The top box represents the peaks formed by the membrane protein coverage spectrum, the blue area in the middle box represents one cleavage region identified in the protein, and the bottom box without a red area indicates that the protein has no known cleavage sites. These results show that the potential cleavage regions identified by the method in this application have good consistency with known viral or host protease-mediated processing regions, indicating that the method can be effectively used for the identification and analysis of known cleavage events. Furthermore, the method in this application also detected several previously unreported potential cleavage sites, thereby expanding the known range of SARS-CoV-2 infection-related protein hydrolysis events, indicating that the method is not only applicable to the identification of known cleavage sites but also to the discovery of novel cleavage sites and the mining of potential viral protease substrates.
[0057] The above results indicate that the protein cleavage site identification method of this application can reliably identify host protein cleavage events during viral infection, and can further analyze the functional characteristics of cleavage substrates and discover new cleavage sites. It is applicable to the study of viral infection, host response and proteolytic regulation mechanisms.
[0058] Example 4 Embodiment 4 of the present invention provides a system corresponding to Embodiment 1 above, including a memory, a processor, and a computer program stored in the memory; the processor executes the computer program in the memory to implement the steps of the method in Embodiment 1 above.
[0059] In some implementations, the memory may be high-speed random access memory (RAM), and may also include non-volatile memory, such as at least one disk storage device.
[0060] In other implementations, the processor can be any type of general-purpose processor, such as a central processing unit (CPU) or a digital signal processor (DSP), and there is no limitation here.
[0061] Example 5 Embodiment 5 of the present invention provides a computer-readable storage medium corresponding to Embodiment 1 above, on which a computer program / instructions are stored. When the computer program / instructions are executed by a processor, they implement the steps of the method of Embodiment 1 above.
[0062] A computer-readable storage medium can be a tangible device that holds and stores instructions for use by an instruction execution device. A computer-readable storage medium can be, for example, but not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any combination thereof.
[0063] Those skilled in the art will understand that embodiments of this application can be provided as methods, systems, or computer program products. Therefore, this application can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, this application can take the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code. The solutions in the embodiments of this application can be implemented in various computer languages, such as the object-oriented programming language Java and the interpreted scripting language JavaScript, as well as Python and R for scientific computing and data analysis, etc., and this application does not limit this to any particular language.
[0064] This application is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of this application. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart... Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.
[0065] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.
[0066] Although preferred embodiments of this application have been described, those skilled in the art, upon learning the basic inventive concept, can make other changes and modifications to these embodiments. Therefore, the appended claims are intended to be interpreted as including the preferred embodiments as well as all changes and modifications falling within the scope of this application.
[0067] Obviously, those skilled in the art can make various modifications and variations to this application without departing from the spirit and scope of this application. Therefore, if such modifications and variations fall within the scope of the claims of this application and their equivalents, this application also intends to include such modifications and variations.
Claims
1. A method for identifying protein cleavage sites based on mass spectrometry data, characterized in that, Includes the following steps: S1. Preprocess the raw mass spectrometry data to obtain the mass spectrometry-derived peptides corresponding to the input precursor protein sequence; S2. Align the mass spectrometry-derived peptides to the corresponding precursor protein sequences to construct a protein site coverage matrix; S3. Based on the protein site coverage matrix, identify potential cleavage regions and potential cleavage sites in the protein; S4. Perform statistical significance analysis on potential cleavage sites in the protein and extract the neighborhood sequences of the cleavage sites; S5. Utilize the statistical significance of potential cleavage sites in the protein to filter potential cleavage sites and obtain high-confidence protein cleavage sites.
2. The method for identifying protein cleavage sites based on mass spectrometry data according to claim 1, characterized in that, Also includes: S6. Output the protein cleavage site identification results; the identification results include the cleavage site neighborhood sequence corresponding to the high-confidence protein cleavage site, the statistical significance of potential cleavage sites in the protein, the cleavage region, the cleavage site location, and the coverage depth.
3. The method for identifying protein cleavage sites based on mass spectrometry data according to claim 1 or 2, characterized in that, The specific implementation process of step S1 includes: Clustering the raw mass spectrometry data yields a clustered spectral library; The input protein sequence and the spectral library are used as input to the Comet software to obtain the mass spectrometry-derived peptides corresponding to the precursor protein sequence.
4. The method for identifying protein cleavage sites based on mass spectrometry data according to claim 1 or 2, characterized in that, The specific implementation process of step S2 includes: Each mass spectrometry-derived peptide was compared with its corresponding precursor protein sequence to determine the start and end positions of each mass spectrometry-derived peptide in the precursor protein. Using each amino acid site of the precursor protein as an index, the number of peptides covering each amino acid site, their frequency of occurrence, or quantitative intensity are counted to obtain the coverage of each amino acid site. The position frequency matrix is then constructed using the coverage to obtain the protein site coverage matrix. In the position frequency matrix, the coverage C of the i-th amino acid site i Defined as the number of times a peptide is identified by mass spectrometry covering the i-th amino acid site, the calculation formula is as follows: ; Where N represents the total number of mass spectrometry-identified peptides aligned to the precursor protein, and s j and e j These represent the start and end positions of the j-th mass spectrometry-derived peptide in the precursor protein, respectively. I is an indicator function that takes the value 1 when the j-th mass spectrometry-derived peptide covers the i-th amino acid site, and 0 otherwise.
5. The method for identifying protein cleavage sites based on mass spectrometry data according to claim 1 or 2, characterized in that, The specific implementation process of step S3 includes: Based on the protein site coverage matrix, a coverage curve for each amino acid site of the precursor protein is constructed, and the coverage curve is smoothed. Peak detection is performed on the smoothed coverage curve to identify local coverage enrichment peaks; Each of the local coverage enrichment peaks is defined as a candidate peak, the left and right boundaries of each candidate peak define a potential cutting region, and the center position of the candidate peak is defined as a potential cutting site.
6. The method for identifying protein cleavage sites based on mass spectrometry data according to claim 5, characterized in that, In step S4, the specific implementation process of performing statistical significance analysis on potential cleavage sites in the protein includes: The p-value, used to characterize the statistical significance of the corresponding potential cleavage site, is calculated using the following formula: ; Where X is a random variable that follows a Poisson distribution with parameter λ. M represents the total number of sites with a coverage greater than 0, and C i denoted as the coverage of the i-th amino acid site, and x is the observation coverage of the center site of a candidate peak.
7. The method for identifying protein cleavage sites based on mass spectrometry data according to claim 5, characterized in that, In step S4, the specific implementation process of extracting the sequence in the neighborhood of the cut site includes: Map the left boundary, center position, and right boundary of each candidate peak back to the coordinates of the corresponding precursor protein sequence; Precursor protein sequences within a predetermined range upstream and downstream of potential cleavage sites are extracted and used as neighborhood sequences of the cleavage sites.
8. The method for identifying protein cleavage sites based on mass spectrometry data according to claim 6, characterized in that, The specific implementation process of step S5 includes: Remove candidate peaks located within low-quality alignment regions; the low-quality alignment region refers to a region where the number of uniquely mappable supporting peptides is less than a preset threshold T. p The area; For the remaining candidate peaks, retain the candidate peaks whose P-value is less than the preset significance threshold α; Calculate the relative coverage score of the retained candidate peaks, and define potential cleavage sites that simultaneously meet the following conditions as high-confidence cleavage sites: the number of uniquely mappable supporting peptides is greater than 2, the P-value is less than the preset significance threshold α, and the relative coverage score R is not lower than the preset threshold. The formula for calculating the relative coverage score R is as follows: C peak The coverage of the candidate peak center site, This indicates the average background coverage of the corresponding precursor protein.
9. A protein cleavage site identification system based on mass spectrometry data, comprising a memory, a processor, and a computer program stored in the memory, characterized in that, The processor executes the computer program to implement the steps of the method according to any one of claims 1 to 8.
10. A computer-readable storage medium having a computer program / instructions stored thereon, characterized in that, When the computer program / instructions are executed by the processor, they implement the steps of the method according to any one of claims 1 to 8.