Peptide fragment-spectrogram matching credibility testing method and system, storage medium and device
A test method and reliability technology, applied in the field of computational proteomics, which can solve problems such as insufficient quality control of identification results
Active Publication Date: 2019-10-18
INST OF COMPUTING TECHNOLOGY - CHINESE ACAD OF SCI
9 Cites 4 Cited by
AI-Extracted Technical Summary
Problems solved by technology
 The technical purpose of the present invention is to solve the problem of insufficient quality control of protein search engine identification results in the field of computational proteomics, and to study the relatio...
The invention provides a peptide fragment-spectrogram matching credibility testing method and system, a storage medium and a device, and the method comprises the steps: inputting the spectrogram datain a to-be-detected result into an open search engine, and obtaining an identification result of the to-be-detected result; obtaining the score of the limited search engine on the result to be detected to obtain a first score, and extracting the n candidate peptide fragments with the first scores being in the front; obtaining the score of the open search engine on the identification result to obtain a second score, and extracting the n candidate peptide fragments with the second scores being in the front at the same time; predicting a theoretical spectrogram of each candidate peptide fragment,calculating cosine similarity between each theoretical spectrogram and the map data in the to-be-detected result, and counting the highest value in the cosine similarity; extracting a four-dimensional feature composed of the first score, the second score, the cosine similarity and the highest cosine similarity value of the to-be-detected result; and inputting the four-dimensional features into anoffline model trained by using an SVM to obtain a credibility test result of the to-be-detected result.
BiostatisticsCharacter and pattern recognition +2
Cosine similaritySpectrum of a sentence +4
- Experimental program(1)
 The credibility testing method proposed by the present invention is just to solve the problems existing in the above-mentioned credibility testing method. The present invention solves the following three technical problems: 1) individual credibility testing is carried out on the identification results; 2) guarantee testing Accuracy of results; 3) Rapid and efficient automated inspection of large-scale identification results.
 At the problems referred to above, the present invention proposes following key points:
 Key point 1, two evaluation indicators for the reliability test method are proposed - FPR (False Positive Rate, False Positive Rate) and FNR (False Negative Rate, False Negative Rate). FPR measures the proportion of a reliability test method that discriminates true and correct identification results as doubtful identification results, and FNR measures the proportion of this reliability test method that discriminates true and false identification results as credible identification results. The smaller the two indicators of FPR and FNR, the stronger the ability of this reliability test method to distinguish correct identification results from wrong identification results.
 The key point 2 is to study the relationship between the evaluation index of the credibility test method and the search engine evaluation index, and establish the standard for the practical application of the credibility test method. The reliability test method can be used to exclude suspicious results in search engine identification results. The lower the FPR, the higher the sensitivity of the identification results after the suspicious results are excluded; the lower the FNR, the higher the accuracy of the identification results after the suspicious results are excluded. Only when the sum of FPR and FNR is less than 1, the accuracy of the identification results will be improved after the suspicious results of the test method are excluded, and only when this condition is met is an effective reliability test method, that is, The standard for the practical application of the reliability test method is that the sum of FPR and FNR of the test method is less than 1.
 Key point 3, from the open search method and the theoretical spectrum prediction method, extract four important features that can represent the matching situation of the peptide and the spectrum in the identification result: 1) the scoring of the identification result by the pFind engine, in the present invention , the result to be tested is the identification result of the pFind engine; 2) Open-pFind engine scores the identification results of the same spectrum, wherein the search engine not only searches but also matches and scores the search results when searching for mass spectrum data; 3) The cosine similarity between the pDeep predicted spectrum and the original experimental spectrum (the spectrum of the identification result to be tested); 4) the cosine of the pDeep predicted spectrum and the original experimental spectrum of the top three candidate peptides of pFind and Open-pFind The highest value of similarity, that is, the highest value among the six candidate peptides. The present invention trains these features based on the SVM method, and proposes an automatic individual credibility checking method pValid.
 When the inventors studied the matching of peptides and spectra in the identification results given by protein search engines, they found that two main factors affecting the identification accuracy of identification results were missed by most search engines. The first factor is the completeness of the search space. The conventional restricted search mode can only consider specific restriction forms and a small number of modification types. However, in biological experiments, often due to the influence of experimental conditions such as experimental time and temperature, it is not All peptides are in the form of specific enzyme cuts; due to the related reagents used in the experiment, some unexpected modifications are often introduced, and the conventional restricted search mode cannot handle these unexpected situations, so the restricted search space is More limited, once the correct result does not exist in the search space, the correct identification result will not be obtained. Open-ended search can consider all enzyme cleavage forms and all possible modification forms on the peptide, and search for possible peptides in a more complete space. In this case, the identification results obtained have undergone more competition, theoretically Generally speaking, the accuracy of open search results is higher than that of restricted search results.
 A second factor that affects how well a peptide matches a spectrum is the theoretical peak intensities of the fragment ions. For the protein search engine, in the process of matching and scoring the peptide and the experimental spectrum, the theoretical fragment ion peak spectrum will be generated for the peptide first, and the similarity between the generated theoretical spectrum and the experimental spectrum will be calculated, so that the peptide Scores for matches between segments and experimental spectra. However, for all protein search engines, the same intensity is assigned to the theoretical fragment ions of peptides, which is contrary to the phenomenon that different fragment ions produced by fragmentation of peptides in experimental spectra have different peak intensities. The present invention adopts the theoretical spectrogram prediction software pDeep to predict the theoretical spectrograms of the fragment ion peak intensities of all identified peptides. At the same time, the present invention considers that the search engine outputs rich candidate peptide information for each spectrogram, and the candidate peptides ranked second and third are strong competitors of the first peptide. Therefore, when we predict the theoretical spectrum, we consider the top three candidate peptides of the open search and the restricted search, calculate the similarity between the theoretical spectrum and the experimental spectrum for all candidate peptides, and extract the corresponding similarity features.
 In order to make the above-mentioned features and effects of the present invention more clear and understandable, the following specific examples are given together with the accompanying drawings for detailed description as follows.
 like figure 2 Shown, technical scheme of the present invention can be divided into 5 steps:
 Step 1. Use each spectrum obtained from the mass spectrometry experiment search as the result to be detected - "peptide-spectrum match", abbreviated as PSM (Peptide-spectrum match) in English, and input the spectrum data in the result to be detected into the open The search engine obtains the identification result of the result to be detected. The result to be detected is PSM, but the input to the open search engine is spectral data (without peptide matching information), and the open search engine searches the spectral data again to obtain "peptide-spectral matching" again, that is, to Each spectrum is assigned a peptide information. Protein search engines include restricted search engines and open search engines;
 Step 2, for each of the results to be detected, obtain the limited search engine pFind to score the results to be detected to obtain the first score, and at the same time extract the top n candidate peptides of the first score, n is the preset A positive integer, n is 3 in this embodiment but not limited thereto. For each identification result, obtain the open search engine Open-pFind to score the identification result, obtain the second score, and extract the top n candidate peptides of the second score;
 Step 3: Use the theoretical spectrum prediction method pDeep to obtain the theoretical spectrum of each candidate peptide, calculate the cosine similarity between each theoretical spectrum and the result to be detected, and count the highest value of the cosine similarity.
 Step 4, extracting a four-dimensional feature composed of the first score, the second score, cosine similarity and the highest value of the result to be detected. For each result to be detected, extract the first score, the second score, the cosine similarity between the theoretical spectrogram and the experimental spectrogram and the spectrogram corresponding to the result to be detected in pFind and Open-pFind The highest cosine similarity between the theoretical spectrum and the experimental spectrum among the 6 candidate peptides is used as a 4D feature.
 Step 5, use the offline model trained by SVM to score the credibility, and judge the category of the identification result according to the score.
 Said step 1 also includes
 Step 11, select an open search engine that has the same scoring mechanism as the restricted search engine.
 Step 12, setting the same search parameters in the open search engine as in the restricted search engine, except for the type of enzyme digestion and the type of modification.
 Said step 2 also includes:
 Step 21, according to the labeled form of the identification result (no label or heavy isotope label), extract the candidate peptide corresponding to the labeled form from the candidate peptide file. If the identification result is in the unlabeled form, only the top three candidate peptides in the unlabeled form are extracted; if the identification result is in the heavy isotope-labeled form, extract the top three candidate peptides in the heavy-isotope-labeled form.
 Step 22, processing the mutation in the candidate peptide, if a certain mutation occurs in the candidate peptide, modify the amino acid of the candidate peptide to the mutated amino acid.
 Step 23, deal with the out-of-range modification in the candidate peptide, if the candidate peptide has a modification that cannot be predicted by pDeep, then delete the candidate peptide from the prediction list.
 Said step 3 also includes:
 Step 31, set the same mass spectrometer and fragmentation energy parameters as the original experiment in the pDeep software.
 Step 32, generate theoretical spectra for all candidate peptides.
 Step 33, calculating the cosine similarity between the theoretical spectrum and the experimental spectrum of all candidate peptides.
 Said step 4 also includes:
 Step 41, extracting the scoring of the identification result by pFind as feature 1.
 Step 42, extracting Open-pFind's scoring of the identification result of the same spectrum as feature 2.
Step 43, extracting the cosine similarity between the theoretical spectrogram and the experimental spectrogram predicted by pDeep of the candidate peptide ranked first in the pFind score, as feature 3.
 Step 44, extracting the highest cosine similarity between the pDeep theoretical spectrum and the experimental spectrum among the six candidate peptides, as feature 4, and collecting the feature 1, feature 2, feature 3 and feature 4 as the four-dimensional feature.
 Also include in said step 5:
 Step 51, sample set construction method: use the intersection identification results of multiple engines (pFind, MaxQuant and PEAKS) as the label set, and use pFind to re-search the spectrum in the label set. Among the re-searched results, the identification results that are consistent with the multi-engine intersection are taken as positive samples, and those that are inconsistent with the multi-engine identification results are taken as negative samples.
 Step 52, extracting four-dimensional features from the samples in the sample set, and normalizing the features to [0,1] to obtain a training set.
 Step 53, use a classification model, such as LIBSVM, to train the training set, and use a radial basis kernel function.
 Step 54, analyze the prediction result of LIBSVM, calculate corresponding FPR and FNR ( Figure 1a ). FPR is calculated as the proportion of positive (suspicious) results predicted in positive samples, and FNR is calculated as the proportion of negative (credible) results predicted in negative samples. If the FPR and FNR are not higher than the FPR (0.06%) and FNR (1.44%) of the synthetic peptide, the training is complete; otherwise, adjust the LIBSVM parameters and retrain the classification model. Finally, the LIBSVM offline model is obtained (see the offline model training process in Figure 1b ).
 Step 55, use the same feature normalization method as the offline model to normalize the four-dimensional features of the identification results.
 Step 56, use the SVM offline model to give a credibility score to the identification result.
 Step 57, give test result according to scoring, if scoring is higher than or equal to 0.5, think that this identification result is credible result; Otherwise, think that this identification result is doubtful result (pValid practical application process such as Figure 1b The Practical usage workflow marked by the solid arrow in the center).
 The following are system embodiments corresponding to the foregoing method embodiments, and this implementation manner may be implemented in cooperation with the foregoing implementation manners. The relevant technical details mentioned in the foregoing implementation manners are still valid in this implementation manner, and will not be repeated here in order to reduce repetition. Correspondingly, the relevant technical details mentioned in this implementation manner may also be applied in the foregoing implementation manners.
 The present invention also proposes a peptide-spectrum matching reliability inspection system, which includes:
 Module 1. The peptide-spectrum matching data obtained from the mass spectrometry experiment search is used as the result to be detected, and the spectrum data in the result to be detected is input into the open search engine to obtain the identification result of the result to be detected;
 Module 2. Obtain the scoring of the result to be detected by the limited search engine, obtain the first score, and extract the top n candidate peptides of the first score, where n is a preset positive integer; obtain the open search engine Scoring the identification result to obtain a second score, and simultaneously extracting the top n candidate peptides of the second score;
 Module 3. Predict the theoretical spectrum of each candidate peptide, calculate the cosine similarity between each theoretical spectrum and the spectral data in the test result, and count the highest value in the cosine similarity;
 Module 4, extracting a four-dimensional feature consisting of the first score, the second score, cosine similarity and the highest value of the result to be detected;
 Module 5. Input the four-dimensional feature into the offline model trained by SVM, perform reliability scoring, and judge the category of the identification result according to the scoring, and use it as the reliability inspection result of the result to be detected.
 In the peptide-spectrum matching credibility checking system, the open search engine has the same scoring mechanism and search parameters as the restricted search engine.
 The described peptide-spectrum matching credibility checking system, wherein the module 4 includes:
 Module 41, extracting the first score of the qualified search engine for the identification result as feature 1;
 Module 42, extracting the second score of the identification result of the open search engine as feature 2;
 Module 43, extracting the cosine similarity between the theoretical spectrogram of the candidate peptide with the highest first score and the spectrogram data in the result to be detected, as feature 3;
 Module 44, extracting the highest cosine similarity between the theoretical spectrograms in all candidate peptides and the spectrogram data in the result to be detected, as feature 4, and integrating the feature 1, the feature 2, the feature 3 and the feature 4 as the four-dimensional feature .
 The described peptide-spectrum matching credibility checking system, wherein the training system of the offline model includes:
 Module 51. Use the multi-engine intersection identification result as the annotation set, use the limited search engine to re-search the spectrum in the annotation set, and use the identification result consistent with the multi-engine intersection among the re-searched results as a positive sample, and use the multi-engine identification If the result is inconsistent, it is used as a negative sample, and the positive sample and the negative sample are collected as a sample set;
 Module 52. Extracting the four-dimensional feature from the sample in the sample set, and normalizing the four-dimensional feature of the sample in the training set to [0,1] to obtain the training set;
 Module 53, using the classification model to train the training set to obtain prediction results;
 Module 54, count the proportion of positive results in the positive samples in the prediction results, as FPR; count the proportion of negative results in the negative samples in the prediction results, as FNR, if both FPR and FNR are less than or equal to the preset threshold, the training is completed , otherwise, adjust the parameters of the classification model and retrain the classification model.
 The present invention also proposes a storage medium, which is used to store the program for executing the method for checking the reliability of the peptide-spectrum matching.
 like image 3 As shown, the present invention also proposes a data processing device, which includes a processing unit and the storage medium, and the processing unit invokes and executes the program in the storage medium.
Description & Claims & Application Information
We can also present the details of the Description, Claims and Application information to help users get a comprehensive understanding of the technical details of the patent, such as background art, summary of invention, brief description of drawings, description of embodiments, and other original content. On the other hand, users can also determine the specific scope of protection of the technology through the list of claims; as well as understand the changes in the life cycle of the technology with the presentation of the patent timeline. Login to view more.