Antibacterial peptide recognition method and device based on secondary structure features and robust statistics
By constructing a standardized secondary structure feature library and MCD robust statistical modeling, and combining Mahalanobis distance squared for antimicrobial peptide identification, the problems of insufficient feature utilization and data bias in existing antimicrobial peptide identification methods are solved, and high-throughput and high-accuracy automated screening is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- XIAMEN OCEAN VOCATIONAL & TECH COLLEGE
- Filing Date
- 2026-05-19
- Publication Date
- 2026-06-19
AI Technical Summary
Existing methods for identifying antimicrobial peptides do not fully utilize secondary structure features, resulting in data bias and poor model robustness, leading to low identification accuracy and a lack of automated processes.
By constructing a standardized secondary structure feature library, employing CLR component data transformation and MCD robust statistical modeling, and combining Mahalanobis distance squared for antimicrobial peptide identification, high-throughput and high-accuracy automated screening is achieved.
This improved the feature utilization rate of antimicrobial peptide recognition, reduced data bias, enhanced the model's anti-interference ability, and achieved automated recognition with high accuracy and reproducibility.
Smart Images

Figure CN122245452A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of biopeptide recognition technology, and in particular to an antimicrobial peptide recognition method and an antimicrobial peptide recognition device based on secondary structure features and robust statistics. Background Technology
[0002] Antimicrobial peptides (AMPs) are a class of small molecule peptides with broad-spectrum antimicrobial activity. Their secondary structure (helices, folds, turns, coils, etc.) is a key characteristic determining their antimicrobial activity. Existing antimicrobial peptide recognition methods have the following core shortcomings: (1) The secondary structure features are not fully utilized. Only the proportion of a single structure (such as the α-spiral ratio) is used for simple judgment. Pattern recognition is not carried out from the overall distribution level of the eight types of secondary structures, and key structural information is lost.
[0003] (2) There is a fundamental bias in the processing of component data. The secondary structure consists of closed data of 8 components. Directly using percentage, Euclidean distance, or conventional PCA will introduce closure bias, resulting in feature distortion and low recognition accuracy.
[0004] (3) The statistical model has poor robustness. Traditional mean and covariance estimation is easily affected by abnormal samples. A small number of atypical positive samples will significantly skew the model, causing threshold drift and high false positive / false negative rates.
[0005] (4) There is no standardized benchmark and full-process automation. There is a lack of a unified secondary structure feature benchmark library. The data format is chaotic and the column names are inconsistent. The model training, feature conversion, sample discrimination and result output have not formed an integrated process. There are many manual operations and they are not reproducible. Summary of the Invention
[0006] This invention aims to at least partially solve one of the technical problems in the aforementioned technologies. To this end, one objective of this invention is to propose an antimicrobial peptide identification method based on secondary structure features and robust statistics. By combining a standardized secondary structure feature library, CLR component data transformation, and MCD robust statistical modeling, this method solves the problems of low feature utilization, large data bias, and weak model anti-interference ability in traditional antimicrobial peptide identification methods, achieving high-throughput, high-accuracy, and highly reproducible automated identification and screening of antimicrobial peptide sequences.
[0007] The second objective of this invention is to propose an antimicrobial peptide recognition device based on secondary structure features and robust statistics.
[0008] To achieve the above objectives, a first aspect of the present invention proposes an antimicrobial peptide identification method based on secondary structure features and robust statistics. The method includes: acquiring raw secondary structure data of known antimicrobial peptides, wherein each raw secondary structure data of a known antimicrobial peptide includes eight secondary structure types and corresponding raw counts; performing structure count transformation and centering logarithmic ratio transformation on the raw secondary structure data of each known antimicrobial peptide to construct a secondary structure feature benchmark library; estimating the robust central mean vector and robust covariance matrix using the minimum covariance determinant based on the secondary structure feature benchmark library, and calculating the squared Mahalanobis distance to determine an identification threshold; acquiring secondary structure data of the peptide to be identified, standardizing the secondary structure data of the peptide to be identified to obtain the squared Mahalanobis distance corresponding to the secondary structure data of the peptide to be identified; and comparing the squared Mahalanobis distance corresponding to the secondary structure data of the peptide to be identified with the identification threshold to obtain the corresponding identification result.
[0009] The antimicrobial peptide identification method based on secondary structure features and robust statistics according to embodiments of the present invention has the following advantages: by combining a standardized secondary structure feature library, CLR component data transformation and MCD robust statistical modeling, it solves the problems of low feature utilization, large data deviation and weak anti-interference ability of traditional antimicrobial peptide identification methods, and realizes high-throughput, high-accuracy and high-reproducibility automated identification and screening of antimicrobial peptide sequences.
[0010] In addition, the antimicrobial peptide recognition method based on secondary structure features and robust statistics proposed in the above embodiments of the present invention may also have the following additional technical features: Optionally, the eight types of secondary structure types include a structure name and a structure code corresponding to each structure name.
[0011] Optionally, the original secondary structure data of each known antimicrobial peptide is subjected to structure counting transformation and centered logarithmic ratio transformation, including: converting the eight types of secondary structures in the original secondary structure data of each known antimicrobial peptide into normalized percentage values according to the original counts, and the sum of the normalized percentage values of the eight types of secondary structures after transformation is 100%; adding the same pseudo-count to all percentage values, taking the logarithm and subtracting the geometric mean to obtain the centered logarithmic ratio transformation result.
[0012] Optionally, the robust center mean vector and robust covariance matrix are estimated using the minimum covariance determinant based on the secondary structure feature benchmark library, and the squared Mahalanobis distance is calculated to determine the recognition threshold based on the squared Mahalanobis distance. This includes: randomly selecting a fixed proportion of samples from all samples in the secondary structure feature benchmark library and iterating repeatedly to form a candidate subset; calculating the mean vector and covariance matrix of 8 types of secondary structures, as well as the determinant value of the covariance matrix, for each candidate subset, and selecting the subset with the smallest determinant value as the final effective subset; calculating the robust center mean vector and robust covariance matrix based on the final effective subset; calculating the squared Mahalanobis distance from all positive samples to the benchmark center based on the robust center mean vector and robust covariance matrix; sorting the squared Mahalanobis distances of all samples from smallest to largest, and finding the squared Mahalanobis distance value at a preset threshold as the recognition threshold.
[0013] Optionally, the squared Mahalanobis distances from all positive samples to the reference center are calculated using the following formula: Mahalanobis²=(x μ)·Σ + ·(x μ)
[0014] Where x represents the centered log ratio of a single sample; μ represents the robust central mean vector; Σ + Denotes the Moore-Penrose pseudoinverse of the robust covariance matrix, (x μ) This represents the transpose of a vector.
[0015] Optionally, the squared Mahalanobis distance corresponding to the secondary structure data of the peptide to be identified is compared with an identification threshold to obtain the corresponding identification result, including: if the squared Mahalanobis distance corresponding to the secondary structure data of the peptide to be identified is less than or equal to the identification threshold, then the secondary structure of the peptide to be identified is determined to belong to the secondary structure distribution characteristics of a typical antimicrobial peptide; otherwise, the secondary structure of the peptide to be identified is determined to belong to the secondary structure distribution characteristics of an atypical antimicrobial peptide.
[0016] Optionally, after obtaining the corresponding recognition results, the process also includes: organizing the data information from the recognition process and generating structured reports and visualization charts.
[0017] To achieve the above objectives, a second aspect of the present invention proposes an antimicrobial peptide identification device based on secondary structure features and robust statistics, comprising: an acquisition module for acquiring raw secondary structure data of known antimicrobial peptides, wherein the raw secondary structure data of each known antimicrobial peptide includes 8 types of secondary structure and corresponding raw counts; a construction module for performing structure count transformation and centering logarithmic ratio transformation on the raw secondary structure data of each known antimicrobial peptide to construct a secondary structure feature benchmark library; a first calculation module for estimating the robust central mean vector and robust covariance matrix using the minimum covariance determinant based on the secondary structure feature benchmark library, and calculating the squared Mahalanobis distance to determine an identification threshold based on the squared Mahalanobis distance; a second calculation module for acquiring secondary structure data of a peptide to be identified, and performing standardization processing on the secondary structure data of the peptide to be identified to obtain the squared Mahalanobis distance corresponding to the secondary structure data of the peptide to be identified; and a comparison and identification module for comparing the squared Mahalanobis distance corresponding to the secondary structure data of the peptide to be identified with the identification threshold to obtain the corresponding identification result. Attached Figure Description
[0018] Figure 1 This is a flowchart illustrating the antimicrobial peptide identification method based on secondary structure features and robust statistics according to an embodiment of the present invention. Figure 2 This is a flowchart illustrating an antimicrobial peptide identification method based on secondary structure features and robust statistics according to an embodiment of the present invention. Figure 3 This is a schematic diagram of the result according to an embodiment of the present invention; Figure 4 This is a schematic diagram of the result according to an embodiment of the present invention; Figure 5 This is a block diagram of an antimicrobial peptide recognition device based on secondary structure features and robust statistics according to an embodiment of the present invention. Detailed Implementation
[0019] Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein the same or similar reference numerals denote the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the accompanying drawings are exemplary and intended to explain the present invention, and should not be construed as limiting the present invention.
[0020] To better understand the above technical solutions, exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. Although exemplary embodiments of the present invention are shown in the drawings, it should be understood that the present invention can be implemented in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided to enable a more thorough understanding of the present invention and to fully convey the scope of the invention to those skilled in the art.
[0021] To better understand the above technical solutions, the following will provide a detailed explanation of the technical solutions in conjunction with the accompanying drawings and specific implementation methods.
[0022] Figure 1 This is a flowchart illustrating the antimicrobial peptide recognition method based on secondary structure features and robust statistics according to an embodiment of the present invention, as shown below. Figure 1 As shown, this method for identifying antimicrobial peptides based on secondary structure features and robust statistics includes the following steps: S101, Obtain raw secondary structure data of known antimicrobial peptides, wherein the raw secondary structure data of each known antimicrobial peptide includes 8 types of secondary structure and corresponding raw counts.
[0023] As an example, the eight secondary structure types include a structure name and a structure code corresponding to each structure name.
[0024] It should be noted that, as shown in Table 1 below, the secondary structures of antimicrobial peptides are divided into 8 categories, which constitute the core feature set for determining whether a sequence is an antimicrobial peptide: Table 1: Secondary structures and corresponding codes of antimicrobial peptides
[0025] As a specific example, suppose the raw secondary structure data of a known antimicrobial peptide is shown in Table 2 below, with a total of 60 amino acids: Table 2: Raw data on the secondary structures of known antimicrobial peptides
[0026] S102, perform structure counting transformation and centered logarithmic ratio transformation on the original secondary structure data of each known antimicrobial peptide to construct a secondary structure feature benchmark library.
[0027] As an example, the original secondary structure data of each known antimicrobial peptide is subjected to structure counting transformation and centered logarithmic ratio transformation, including: converting the eight types of secondary structures in the original secondary structure data of each known antimicrobial peptide into normalized percentage values according to the original counts, and the sum of the normalized percentage values of the eight types of secondary structures after transformation is 100%; adding the same pseudo-count to all percentage values, taking the logarithm and subtracting the geometric mean to obtain the centered logarithmic ratio transformation result.
[0028] As a specific implementation, the calculation logic for the normalized percentage is as follows: Class structure percentage = Original count of a single class structure ÷ Total count × 100%, ensuring that the sum of the percentages of the 8 classes is 100%. The normalized percentages calculated according to Table 2 above are shown in Table 3 below: Table 3: Raw counts converted to rows and 100% percentage normalized data
[0029] It should be noted that the difference in length of different polypeptide sequences affects the proportion of the structure. The effect of the difference in length of different polypeptide sequences on the proportion of the structure was eliminated by the normalization process, and the structural characteristics of all samples were unified in the form of percentage distribution.
[0030] Next, since the percentage of G-class structures is 0, directly taking the logarithm would result in meaningless singular values. Therefore, a pseudo-count of 0.5 is added to all percentage values. The results are shown in Table 4 below: Table 4: Results of Pseudo-counting
[0031] Next, a CLR-centered logarithmic transformation is performed. The CLR transformation formula is:
[0032] in, This represents the numerical value of the i-th structure after adding a pseudo-count, where n=8 is the total number of structure categories. This is the geometric mean of the logarithms of all categories.
[0033] The specific calculation results are shown in Table 5 below: Table 5: CLR Transformation Results
[0034] Therefore, after processing the raw count data of all known antimicrobial peptides as described above, a standardized feature library can be obtained: the closure bias of the component data is eliminated, and it can be directly used for subsequent robust statistical model training; the feature space of all samples is completely uniform, and there are no differences in naming, dimensions, or numerical range.
[0035] S103. Based on the secondary structure feature benchmark library, the robust center mean vector and robust covariance matrix are estimated using the minimum covariance determinant, and the squared Mahalanobis distance is calculated to determine the identification threshold based on the squared Mahalanobis distance.
[0036] As an example, the robust center mean vector and robust covariance matrix are estimated using the minimum covariance determinant based on the secondary structure feature benchmark library, and the squared Mahalanobis distance is calculated to determine the recognition threshold. This includes: randomly selecting a fixed proportion of samples from all samples in the secondary structure feature benchmark library and iterating repeatedly to form a candidate subset; calculating the mean vector and covariance matrix of 8 types of secondary structures, as well as the determinant value of the covariance matrix, for each candidate subset, and selecting the subset with the smallest determinant value as the final effective subset; calculating the robust center mean vector and robust covariance matrix based on the final effective subset; calculating the squared Mahalanobis distance from all positive samples to the benchmark center based on the robust center mean vector and robust covariance matrix; sorting the squared Mahalanobis distances of all samples from smallest to largest, and finding the squared Mahalanobis distance value at a preset threshold as the recognition threshold.
[0037] As an example, the squared Mahalanobis distance from all positive samples to the reference center is calculated according to the following formula: Mahalanobis²=(x μ)·Σ + ·(x μ)
[0038] Where x represents the centered log ratio of a single sample; μ represents the robust central mean vector; Σ + Denotes the Moore-Penrose pseudoinverse of the robust covariance matrix, (x μ) This represents the transpose of a vector.
[0039] As a specific implementation, the core objective of the MCD method is to select an optimal subset of 75% from all samples. The subset has the smallest determinant of its covariance matrix, indicating that the sample distribution within the subset is the most concentrated and the interference from outliers is minimal.
[0040] Calculation process: (1) Sampling settings: Total sample size n=20, each time 75% i.e. 15 samples are randomly selected to form a candidate subset, and a total of 500 iterations are performed.
[0041] (2) Single iteration calculation: For each candidate subset, calculate the mean vector of the 8 features of that subset. Covariance Matrix Then calculate the determinant value det of the covariance matrix. The smaller the determinant value, the lower the dispersion of the samples within the subset and the more concentrated their distribution.
[0042] (3) Optimal subset selection: After 500 iterations, select the determinant value det( The smallest subset is taken as the final valid subset of the MCD.
[0043] It should be noted that outlier samples in the example are excluded from the optimal subset, thereby suppressing outliers. Compared with the traditional method of directly calculating the mean and covariance of all samples, the anti-interference ability is improved by more than 60%.
[0044] (4) Output robust statistical parameters: Based on the final effective subset, the final robust center mean vector μ (8 elements, corresponding to the CLR mean of 8 types of structures) and robust covariance matrix Σ (8×8 matrix, reflecting the covariance relationship between 8 types of structures) are calculated.
[0045] Next, the squared Mahalanobis distance from all positive samples to the baseline center is calculated. The 20 squared Mahalanobis distance values are sorted from smallest to largest, and the value at the 95th percentile is used as the identification threshold to ensure that 95% of the known positive samples can be identified as typical antimicrobial peptides, while excluding a few abnormal outliers.
[0046] It should be noted that the identification threshold eliminates the interference of outlier positive samples, and compared with the traditional threshold method of the mean plus 3 times the standard deviation, the stability and identification accuracy are significantly improved.
[0047] S104: Obtain the secondary structure data of the peptide to be identified, and standardize the secondary structure data of the peptide to be identified to obtain the squared Mahalanobis distance corresponding to the secondary structure data of the peptide to be identified.
[0048] It should be noted that the standardization process is the same as the steps described above, including reading in the Excel data to be identified; performing column matching with the benchmark library standard, filling missing values with 0, and numericalizing; performing the same percentage normalization and CLR transformation; and calculating the squared Mahalanobis distance from the sample to the benchmark center.
[0049] S105, compare the squared Mahalanobis distance corresponding to the secondary structure data of the polypeptide to be identified with the identification threshold to obtain the corresponding identification result.
[0050] As an example, the squared Mahalanobis distance corresponding to the secondary structure data of the peptide to be identified is compared with the identification threshold to obtain the corresponding identification result, including: if the squared Mahalanobis distance corresponding to the secondary structure data of the peptide to be identified is less than or equal to the identification threshold, then the secondary structure of the peptide to be identified is determined to belong to the secondary structure distribution characteristics of a typical antimicrobial peptide; otherwise, the secondary structure of the peptide to be identified is determined to belong to the secondary structure distribution characteristics of an atypical antimicrobial peptide.
[0051] Therefore, by using the identification threshold as a benchmark, we can obtain the identification tags (typical / atypical antimicrobial peptides), corresponding Mahalanobis distance values, and the proportion data of 8 types of structures for all candidate sequences to be identified.
[0052] As an example, after obtaining the corresponding recognition results, the process also includes: organizing the data information of the recognition process and generating structured reports and visualization charts.
[0053] In other words, the output includes a CSV report and four types of visualization charts. The CSV report includes serial number / ID, 8 types of structure, Mahalanobis distance, and identification label. The four types of visualization charts include PCA distribution chart of the benchmark library and candidate sequences, Mahalanobis distance distribution histogram, Top 20 ranking chart of outlier sequences, and classification comparison chart of identification results.
[0054] Visualization results as follows Figure 3-4 As shown, the effectiveness of the antimicrobial peptide identification method of this application is jointly verified: the PCA distribution map shows that the secondary structure features of most candidate peptides overlap with the distribution of known antimicrobial peptides and fall within the 95% confidence interval; the Mahalanobis distance distribution histogram shows that the squared Mahalanobis distance of the vast majority of candidate sequences is lower than the identification threshold of 21.27, indicating that the method can effectively screen out a large number of candidate sequences that meet the typical secondary structure features of antimicrobial peptides.
[0055] In summary, such as Figure 2 As shown, the antimicrobial peptide identification method based on secondary structure features and robust statistics in this application first loads positive sample data of known antimicrobial peptide secondary structures. After standardization matching of 8 types of structure column names, conversion of counts to normalized percentages, and CLR centering logarithmic ratio transformation, a benchmark library is constructed. Then, the central mean and covariance matrix are obtained through robust estimation by MCD. The square of the Mahalanobis distance of positive samples is calculated and the 95th percentile is taken to determine the identification threshold. Subsequently, candidate sequence data to be identified is loaded. After the same standardization cleaning and transformation, the Mahalanobis distance is calculated for each sample and the threshold discrimination is completed. Finally, the identification results are output in CSV format and visualization charts such as PCA distribution map and distance distribution map are generated, realizing high-throughput and high-accuracy automated screening of antimicrobial peptides.
[0056] To achieve the above embodiments, such as Figure 5 As shown in the figure, this embodiment of the invention also proposes an antimicrobial peptide recognition device based on secondary structure features and robust statistics, including: an acquisition module 10, a construction module 20, a first calculation module 30, a second calculation module 40, and a comparison and recognition module 50.
[0057] The system comprises the following modules: Acquisition module 10 acquires raw secondary structure data of known antimicrobial peptides, where each raw secondary structure data includes eight secondary structure types and their corresponding raw counts; Construction module 20 performs structure count transformation and centering logarithmic ratio transformation on the raw secondary structure data of each known antimicrobial peptide to construct a secondary structure feature benchmark library; First calculation module 30 estimates the robust central mean vector and robust covariance matrix using the minimum covariance determinant based on the secondary structure feature benchmark library, and calculates the squared Mahalanobis distance to determine the identification threshold; Second calculation module 40 acquires the secondary structure data of the peptide to be identified, standardizes the secondary structure data, and obtains the squared Mahalanobis distance corresponding to the secondary structure data; Comparison and identification module 50 compares the squared Mahalanobis distance corresponding to the secondary structure data with the identification threshold to obtain the corresponding identification result.
[0058] It should be noted that the above description and examples of the antimicrobial peptide recognition method based on secondary structure features and robust statistics are also applicable to the antimicrobial peptide recognition device based on secondary structure features and robust statistics in this embodiment, and will not be repeated here.
[0059] Those skilled in the art will understand that embodiments of the present invention can be provided as methods, systems, or computer program products. Therefore, the present invention can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention can take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.
[0060] This invention is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart illustrations and / or block diagrams. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.
[0061] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.
[0062] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.
[0063] It should be noted that any reference signs placed between parentheses in the claims should not be construed as limiting the claims. The word "comprising" does not exclude the presence of components or steps not listed in the claims. The word "a" or "an" preceding a component does not exclude the presence of a plurality of such components. The invention can be implemented by means of hardware comprising several different components and by means of a suitably programmed computer. In a unit claim enumerating several means, several of these means may be embodied by the same item of hardware. The use of the words first, second, and third, etc., does not indicate any order. These words can be interpreted as names.
[0064] Although preferred embodiments of the invention have been described, those skilled in the art, upon learning the basic inventive concept, can make other changes and modifications to these embodiments. Therefore, the appended claims are intended to be interpreted as including both the preferred embodiments and all changes and modifications falling within the scope of the invention.
[0065] Obviously, those skilled in the art can make various modifications and variations to this invention without departing from its spirit and scope. Therefore, if these modifications and variations fall within the scope of the claims of this invention and their equivalents, this invention also intends to include these modifications and variations.
[0066] In the description of this invention, it should be understood that the terms "first" and "second" are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of indicated technical features. Therefore, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of this invention, "a plurality of" means two or more, unless otherwise explicitly specified.
[0067] In this invention, unless otherwise explicitly specified and limited, the terms "installation," "connection," "linking," and "fixing," etc., should be interpreted broadly. For example, they can refer to a fixed connection, a detachable connection, or an integral part; they can refer to a mechanical connection or an electrical connection; they can refer to a direct connection or an indirect connection through an intermediate medium; they can refer to the internal communication of two components or the interaction between two components. Those skilled in the art can understand the specific meaning of the above terms in this invention according to the specific circumstances.
[0068] In this invention, unless otherwise explicitly specified and limited, "above" or "below" the second feature can mean that the first feature is in direct contact with the second feature, or that the first feature is in indirect contact with the second feature through an intermediate medium. Furthermore, "above," "over," and "on top" of the second feature can mean that the first feature is directly above or diagonally above the second feature, or simply that the first feature is at a higher horizontal level than the second feature. "Below," "below," and "under" the second feature can mean that the first feature is directly below or diagonally below the second feature, or simply that the first feature is at a lower horizontal level than the second feature.
[0069] In the description of this specification, the references to terms such as "one embodiment," "some embodiments," "example," "specific example," or "some examples," etc., refer to specific features, structures, materials, or characteristics described in connection with that embodiment or example, which are included in at least one embodiment or example of the present invention. In this specification, the illustrative expressions of the above terms should not be construed as necessarily referring to the same embodiment or example. Furthermore, the specific features, structures, materials, or characteristics described may be combined in any suitable manner in one or more embodiments or examples. Moreover, without contradiction, those skilled in the art can combine and integrate the different embodiments or examples described in this specification, as well as the features of different embodiments or examples.
[0070] Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention. Those skilled in the art can make changes, modifications, substitutions and variations to the above embodiments within the scope of the present invention.
Claims
1. A method for identifying antimicrobial peptides based on secondary structure features and robust statistics, characterized in that, Includes the following steps: Obtain raw secondary structure data of known antimicrobial peptides, wherein the raw secondary structure data of each known antimicrobial peptide includes 8 types of secondary structure and corresponding raw counts; The original secondary structure data of each known antimicrobial peptide are subjected to structure counting transformation and centered logarithmic ratio transformation to construct a secondary structure feature benchmark library; Based on the secondary structure feature benchmark library, the robust center mean vector and robust covariance matrix are estimated using the minimum covariance determinant, and the squared Mahalanobis distance is calculated to determine the recognition threshold based on the squared Mahalanobis distance. Acquire the secondary structure data of the peptide to be identified, and perform standardization processing on the secondary structure data of the peptide to be identified to obtain the squared Mahalanobis distance corresponding to the secondary structure data of the peptide to be identified. The squared Mahalanobis distance corresponding to the secondary structure data of the polypeptide to be identified is compared with the identification threshold to obtain the corresponding identification result.
2. The method for identifying antimicrobial peptides based on secondary structure features and robust statistics as described in claim 1, characterized in that, The eight types of secondary structures include a structure name and a structure code corresponding to each structure name.
3. The antimicrobial peptide recognition method based on secondary structure features and robust statistics as described in claim 1, characterized in that, The original secondary structure data of each known antimicrobial peptide are subjected to structure counting transformation and centered logarithmic ratio transformation, including: The eight secondary structures in the original secondary structure data of each known antimicrobial peptide were converted into normalized percentage values based on the original counts. The sum of the normalized percentage values of the eight secondary structures after conversion was 100%. The same pseudo-count is added to all percentage values, the logarithm is taken, and the geometric mean is subtracted to obtain the centered logarithmic ratio transformation result.
4. The method for identifying antimicrobial peptides based on secondary structure features and robust statistics as described in claim 1, characterized in that, Based on the aforementioned secondary structure feature benchmark library, the robust central mean vector and robust covariance matrix are estimated using the minimum covariance determinant, and the squared Mahalanobis distance is calculated to determine the recognition threshold, including: A fixed proportion of samples are randomly selected from all samples in the secondary structure feature benchmark library, and the selection is repeated multiple times to form a candidate subset. For each candidate subset, calculate the mean vector and covariance matrix of the eight types of secondary structures, as well as the determinant value of the covariance matrix, and select the subset with the smallest determinant value as the final effective subset; The robust central mean vector and robust covariance matrix are calculated based on the final effective subset. Calculate the squared Mahalanobis distances from all positive samples to the baseline center based on the robust center mean vector and robust covariance matrix; Sort all samples by squared Mahalanobis distance from smallest to largest, and find the squared Mahalanobis distance value at a position that is at a preset threshold as the recognition threshold.
5. The method for identifying antimicrobial peptides based on secondary structure features and robust statistics as described in claim 4, characterized in that, Calculate the squared Mahalanobis distance of all positive samples to the reference center using the following formula: Mahalanobis²=(x μ)·Σ + ·(x μ) Where x represents the centered log ratio of a single sample; μ represents the robust central mean vector; Σ + Denotes the Moore-Penrose pseudoinverse of the robust covariance matrix, (x μ) This represents the transpose of a vector.
6. The method for identifying antimicrobial peptides based on secondary structure features and robust statistics as described in claim 1, characterized in that, The squared Mahalanobis distance corresponding to the secondary structure data of the polypeptide to be identified is compared with the identification threshold to obtain the corresponding identification result, including: If the squared Mahalanobis distance corresponding to the secondary structure data of the peptide to be identified is less than or equal to the identification threshold, then the secondary structure of the peptide to be identified is determined to belong to the secondary structure distribution characteristics of a typical antimicrobial peptide; otherwise, the secondary structure of the peptide to be identified is determined to belong to the secondary structure distribution characteristics of an atypical antimicrobial peptide.
7. The method for identifying antimicrobial peptides based on secondary structure features and robust statistics as described in claim 1, characterized in that, After obtaining the corresponding recognition results, the process also includes: organizing the data information from the recognition process and generating structured reports and visualization charts.
8. An antimicrobial peptide recognition device based on secondary structure features and robust statistics, characterized in that, include: The acquisition module is used to acquire the raw secondary structure data of known antimicrobial peptides. The raw secondary structure data of each known antimicrobial peptide includes 8 types of secondary structure and their corresponding raw counts. A construction module is used to perform structure counting transformation and centered logarithmic ratio transformation on the original secondary structure data of each known antimicrobial peptide to construct a secondary structure feature benchmark library; The first calculation module is used to estimate the robust center mean vector and robust covariance matrix using the minimum covariance determinant based on the secondary structure feature benchmark library, and to calculate the squared Mahalanobis distance to determine the identification threshold based on the squared Mahalanobis distance. The second calculation module is used to acquire secondary structure data of the polypeptide to be identified, and to standardize the secondary structure data of the polypeptide to be identified in order to obtain the squared Mahalanobis distance corresponding to the secondary structure data of the polypeptide to be identified. The comparison and recognition module is used to compare the squared Mahalanobis distance corresponding to the secondary structure data of the polypeptide to be identified with the recognition threshold to obtain the corresponding recognition result.