Allergen identification system based on allergen family characteristic peptide sets
By constructing an allergen discrimination system based on the characteristic peptide group of allergen families, the problem of low accuracy in identifying novel or unknown allergens in existing technologies has been solved, and efficient and accurate discrimination of unknown allergens has been achieved.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- THE SECOND AFFILIATED HOSPITAL OF GUANGZHOU MEDICAL UNIVERSITY
- Filing Date
- 2024-12-11
- Publication Date
- 2026-06-11
Smart Images

Figure CN2024138530_11062026_PF_FP_ABST
Abstract
Description
An allergen identification system based on allergen family characteristic peptide groups Technical Field
[0001] This invention relates to the field of allergen identification technology, and in particular to an allergen identification system based on the characteristic peptide group of an allergen family. Background Technology
[0002] Currently, with the incidence of allergic reactions rising year by year, rapid and accurate identification of allergens plays a crucial role in the diagnosis and prevention of allergic diseases. Most existing allergen identification methods are based on protein sequence feature extraction and classification algorithms, but their accuracy and generalization ability vary depending on the feature engineering techniques used. Commonly used allergen identification software includes ALLERGENFP, ALLERTOP, ALLERMATCH, and SORTALLER. They primarily rely on the transformation of protein amino acid sequence features and different classification algorithms to extract and screen protein information related to allergen characteristics.
[0003] Most existing allergen identification methods rely on human experience. While these methods provide some predictive criteria, they are difficult to accurately identify novel or unknown allergens, thus reducing the accuracy of allergen identification. Summary of the Invention
[0004] This invention provides an allergen discrimination system based on the characteristic peptide group of an allergen family, which has wide applicability and stability in the face of novel or unknown allergens, thereby improving the accuracy of the discrimination of unknown allergens.
[0005] An embodiment of the present invention provides an allergen discrimination system based on allergen family characteristic peptide groups, comprising:
[0006] The module includes modules for determining allergen characteristic peptide groups, acquiring distribution features, constructing allergen identification feature sets, generating feature vectors for unknown allergens, and determining discrimination results.
[0007] The aforementioned allergen characteristic peptide group determination module is used to obtain the allergen characteristic peptide group based on a preset allergen database and a preset protein database.
[0008] The aforementioned distribution feature acquisition module is used to determine, based on the allergen and the allergen feature peptide group, the enriched allergen, the functional features corresponding to the enriched allergen, the enrichment degree and frequency of the allergen, and obtain the distribution features of the enriched allergen based on the enrichment degree, frequency and functional features.
[0009] The aforementioned allergen identification feature set construction module is used to determine a stable allergen feature peptide set with high stability and a multi-targeting feature based on the aforementioned enriched allergen feature peptide set, and to combine the aforementioned multi-targeting feature, the aforementioned distribution feature and the aforementioned allergen feature peptide set to obtain an allergen identification feature set.
[0010] The aforementioned unknown allergen feature vector generation module is used to obtain the unknown allergen feature vector based on the unknown allergen sequence to be identified and the aforementioned stable allergen feature peptide group.
[0011] The aforementioned discrimination result determination module is used to input the aforementioned unknown allergen feature vector into a preset allergen discrimination model trained based on the aforementioned allergen identification feature set, so that the aforementioned preset allergen discrimination model determines the discrimination result of the aforementioned unknown allergen sequence based on the aforementioned unknown allergen feature vector, the aforementioned allergen identification feature set, and the aforementioned stable allergen feature peptide set.
[0012] Furthermore, based on the preset allergen database and preset protein database, the allergen characteristic peptide set is obtained, including:
[0013] Allergen sequences were extracted from the aforementioned pre-defined allergen database;
[0014] Non-allergen sequences are extracted from the aforementioned preset protein database according to preset keywords, and duplicates are removed from the aforementioned non-allergen sequences to obtain the first simplified non-allergen sequence.
[0015] Calculate the first identity between the above allergen sequence and the above first simplified non-allergen sequence, retain the first simplified non-allergen sequence whose first identity is less than a preset first identity threshold, and obtain the second simplified non-allergen sequence;
[0016] The above-mentioned second simplified non-allergen sequence is divided into several amino acid sequences according to a preset number of amino acids. The second identity of each amino acid sequence with the allergen sequence is calculated. The amino acid sequence with the second identity less than the preset second identity threshold and which does not contain a preset number of consecutive amino acids is determined as the third simplified non-allergen sequence.
[0017] Based on the above allergen sequence and the above third simplified non-allergen sequence, the above allergen characteristic peptide group was obtained.
[0018] Furthermore, based on the aforementioned allergen sequence and the aforementioned third simplified non-allergen sequence, the aforementioned allergen characteristic peptide set is obtained, including:
[0019] The allergen sequence was segmented into several allergen peptides by using a sliding window method according to a preset base length.
[0020] Calculate the similarity between each of the above-mentioned allergen peptide segments and each of the above-mentioned third simplified non-allergen sequences, and retain the allergen peptide segments with similarity not lower than a preset similarity threshold to obtain highly distinguishable allergen peptide segments.
[0021] High-resolution allergen peptides that are adjacent to each other on the same allergen sequence are spliced together to obtain the first allergen characteristic peptide.
[0022] Repeat the sequence alignment operation until several third allergen characteristic peptides are generated;
[0023] Based on the aforementioned third allergen characteristic peptide, the aforementioned allergen characteristic peptide group was obtained;
[0024] The sequence alignment operations mentioned above include:
[0025] Obtain the current second allergen characteristic peptide; wherein, the initial second allergen characteristic peptide is the aforementioned first allergen characteristic peptide;
[0026] Calculate the third identity between each current second allergen characteristic peptide and each of the above-mentioned non-allergen sequences;
[0027] Compare the current third identity with the aforementioned preset first identity threshold;
[0028] If the current third identity is greater than the preset first identity threshold, the second allergen feature peptide corresponding to the current third identity is removed, and the second allergen feature peptides that are adjacent to each other on the above allergen sequence are spliced together to obtain the updated second allergen feature peptide.
[0029] If the third identity of all the current second allergen characteristic peptides is not greater than the above-preset first identity threshold, the current second allergen characteristic peptide is spliced with another second allergen characteristic peptide that is adjacent to it on the above allergen sequence to obtain several third allergen characteristic peptides.
[0030] Furthermore, based on the aforementioned third allergen characteristic peptide, the aforementioned allergen characteristic peptide group is obtained, including:
[0031] Based on the sequence characteristics of each third allergen characteristic peptide, a similarity matrix is constructed between each pair of third allergen characteristic peptides;
[0032] The similarity scores between each pair of the characteristic peptides of the third allergen are calculated based on the above similarity matrix.
[0033] The third allergen characteristic peptides with similarity scores not less than a preset similarity score threshold are clustered to obtain the above-mentioned allergen characteristic peptide group.
[0034] Furthermore, based on the allergen and its characteristic peptide group, the enriched allergen characteristic peptide group, the enriched allergen, the corresponding functional characteristics of the enriched allergen, and the enrichment degree and frequency of the allergen are determined, including:
[0035] Calculate the frequency and enrichment of each type of allergen in each of the above-mentioned allergen characteristic peptide groups;
[0036] The allergen characteristic peptide group with an enrichment degree greater than the preset enrichment degree threshold is identified as the first enriched allergen characteristic peptide group. The allergen with an enrichment degree greater than the preset enrichment degree threshold is taken as the enriched allergen, and the functional characteristics corresponding to the enriched allergen are obtained.
[0037] The above-mentioned enriched allergen characteristic peptide group is generated by adding the corresponding enriched allergen functional characteristics to the first enriched allergen characteristic peptide group.
[0038] Furthermore, based on the above-mentioned enriched allergen characteristic peptide group, a stable allergen characteristic peptide group with high stability and multi-targeting characteristics was identified, including:
[0039] Calculate the correlation weights between different enriched allergen characteristic peptide groups, and based on the correlation weights and functional characteristics, merge enriched allergen characteristic peptide groups with correlation weights not less than a preset correlation weight value or with the same functional characteristics to obtain functional allergen characteristic peptide groups.
[0040] Obtain the number of enriched allergens within the above-mentioned functional allergen characteristic peptide group, and the number of the above-mentioned third allergen characteristic peptide;
[0041] The functional allergen characteristic peptide group that has an enriched number of allergens not less than a preset enriched allergen number threshold and an enriched number of third allergen characteristic peptides not less than a preset third allergen characteristic peptide number threshold is defined as the stable allergen characteristic peptide group with high stability.
[0042] Determine the number of characteristic peptide groups of different stable allergens corresponding to the same enriched allergen;
[0043] If the above quantity is not less than the preset quantity threshold, then the corresponding enriched allergen is determined to be a multi-targeted allergen.
[0044] Based on the above-mentioned multi-targeted allergens, the above-mentioned multi-targeted characteristics were obtained.
[0045] Furthermore, based on the aforementioned multi-targeting allergens, the aforementioned multi-targeting characteristics are obtained, including:
[0046] Obtain the enrichment, sequence characteristics, physicochemical characteristics, and structural characteristics of the above-mentioned multi-targeted allergens;
[0047] Based on principal component analysis, the enrichment degree, the sequence characteristics, the physicochemical characteristics, and the structural characteristics, principal component characteristics are obtained, and these principal component characteristics are used as the multi-targeting characteristics.
[0048] Furthermore, based on the unknown allergen sequence to be identified and the aforementioned stable allergen characteristic peptide group, the unknown allergen feature vector is obtained, including:
[0049] The unknown allergen sequence was segmented according to the preset base length using the sliding window method to obtain several unknown allergen peptide segments.
[0050] The unknown allergen peptides are mapped to the stable allergen feature peptide groups to extract the first distribution features and the first multi-targeting features of the unknown allergen peptides in each stable allergen feature peptide group. Then, based on the first distribution features and the first multi-targeting features, the unknown allergen feature vector is generated.
[0051] Furthermore, the aforementioned preset allergen discrimination model determines the discrimination result of the aforementioned unknown allergen sequence based on the aforementioned unknown allergen feature vector, the aforementioned allergen identification feature set, and the aforementioned stable allergen feature peptide set, including:
[0052] Extract the second distribution feature and the second multi-target feature corresponding to the same allergen from the above allergen identification feature set, and generate an allergen feature vector based on the above second distribution feature and the second multi-target feature.
[0053] Based on the above-mentioned feature vectors of unknown allergens and the feature vectors of allergens corresponding to different allergens, the feature similarity is calculated.
[0054] Calculate the third enrichment degree of the above-mentioned unknown allergen feature vectors in each of the above-mentioned stable allergen feature peptide groups;
[0055] The number of stable allergen characteristic peptide groups with a third enrichment level greater than the preset third enrichment level is counted.
[0056] Based on the aforementioned feature similarity and the aforementioned first quantity, the discrimination result of the aforementioned unknown allergen sequence is determined.
[0057] Furthermore, the determination of the discrimination result of the unknown allergen sequence based on the aforementioned feature similarity and the aforementioned first quantity includes:
[0058] If the aforementioned feature similarity is greater than a preset feature similarity threshold, or if the aforementioned first quantity is greater than a preset first quantity threshold, then the aforementioned unknown allergen sequence is determined to be an allergen.
[0059] The embodiments of the present invention have the following beneficial effects:
[0060] This invention provides an allergen discrimination system based on allergen family characteristic peptide groups. The system includes: an allergen characteristic peptide group determination module, a distribution feature acquisition module, an allergen identification feature set construction module, an unknown allergen feature vector generation module, and a discrimination result determination module. The allergen characteristic peptide group determination module is used to obtain allergen characteristic peptide groups based on a preset allergen database and a preset protein database. The distribution feature acquisition module is used to determine, based on allergens and allergen characteristic peptide groups, enriched allergens, the corresponding functional characteristics of the enriched allergens, the enrichment degree and frequency of the allergens, and obtain the distribution characteristics of the enriched allergens based on the enrichment degree, frequency, and functional characteristics. The allergen identification feature set construction module is used to determine the distribution characteristics of the enriched allergen characteristic peptide groups based on the allergens and allergen characteristic peptide groups. The system comprises: a set of stable allergen characteristic peptides and a set of multi-targeting features; and a set of allergen characteristic peptides that are highly stable. The system combines these multi-targeting features, distribution features, and allergen characteristic peptides to obtain an allergen identification feature set. An unknown allergen feature vector generation module generates an unknown allergen feature vector based on the unknown allergen sequence to be identified and the stable allergen characteristic peptides. A discrimination result determination module inputs the unknown allergen feature vector into a preset allergen discrimination model trained based on the allergen identification feature set, so that the preset allergen discrimination model determines the discrimination result of the unknown allergen sequence based on the unknown allergen feature vector, the allergen identification feature set, and the stable allergen characteristic peptides. Therefore, in the process of identifying unknown allergens, this invention utilizes the feature vector of the unknown allergen and a pre-constructed allergen identification feature set that includes multi-target features, distribution features, and allergen characteristic peptide groups. This allows for the full integration of multi-range features of allergens in the process of identifying unknown allergens, thus providing broad applicability and stability for novel or unknown allergens, thereby improving the accuracy of identifying unknown allergens. Attached Figure Description
[0061] Figure 1 is a schematic diagram of an allergen discrimination system based on allergen family characteristic peptide groups provided in an embodiment of the present invention.
[0062] Figure 2 is a schematic diagram of the characteristic specificity curve between the allergen peptide and the third simplified non-allergen sequence provided in an embodiment of the present invention.
[0063] Figure 3 is a schematic diagram of the stability curve of a stable allergen characteristic peptide group provided in an embodiment of the present invention. Detailed Implementation
[0064] The technical solutions of this invention will now be clearly and completely described with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of this invention, and not all of them. All other embodiments obtained by those skilled in the art based on the embodiments of this invention without creative effort are within the scope of protection of this invention.
[0065] As shown in Figure 1, an embodiment of the present invention provides an allergen discrimination system based on allergen family characteristic peptide groups, comprising:
[0066] The module includes modules for determining allergen characteristic peptide groups, acquiring distribution features, constructing allergen identification feature sets, generating feature vectors for unknown allergens, and determining discrimination results.
[0067] The aforementioned allergen characteristic peptide group determination module is used to obtain the allergen characteristic peptide group based on a preset allergen database and a preset protein database.
[0068] In a preferred embodiment, the above-mentioned method of obtaining an allergen characteristic peptide set based on a preset allergen database and a preset protein database includes:
[0069] Allergen sequences were extracted from the aforementioned pre-defined allergen database;
[0070] Specifically, the aforementioned preset allergen database is the ALLEGENIA 2.0 allergen database. The sequences of all allergens are extracted from the ALLEGENIA 2.0 allergen database, and the extracted allergen sequences are converted into FASTA format to ensure data format consistency, thus obtaining the aforementioned allergen sequences.
[0071] Non-allergen sequences are extracted from the aforementioned preset protein database according to preset keywords, and duplicates are removed from the aforementioned non-allergen sequences to obtain the first simplified non-allergen sequence.
[0072] Specifically, the aforementioned protein database is the Uniprot database. Extraction was performed in the Uniprot database using keywords such as "non-allergen", "nonallergenic", "hypoallergenic", "non-allergy", and "no allergenicity" to identify potential non-allergen candidate sequences. These sequences were then further filtered based on their corresponding functional annotations, selecting those not marked as allergens in the annotations to obtain the aforementioned non-allergen sequences. Simultaneously, the non-allergen sequences were converted to FASTA format to ensure data consistency.
[0073] Calculate the first identity between the above allergen sequence and the above first simplified non-allergen sequence, retain the first simplified non-allergen sequence whose first identity is less than a preset first identity threshold, and obtain the second simplified non-allergen sequence;
[0074] Preferably, the preset first identity threshold is 75%.
[0075] The above-mentioned second simplified non-allergen sequence is divided into several amino acid sequences according to a preset number of amino acids. The second identity of each amino acid sequence with the allergen sequence is calculated. The amino acid sequence with the second identity less than the preset second identity threshold and which does not contain a preset number of consecutive amino acids is determined as the third simplified non-allergen sequence.
[0076] Specifically, the FAO / WHO discrimination criteria are applied for screening: within an 80-amino acid window, sequences with an identity to the allergen sequence below a preset second identity threshold of 35%, and which do not contain a preset number of consecutive identical amino acids (i.e., 6 consecutive identical amino acid fragments), are considered as the third simplified non-allergen sequences. This screening step generates accurate and pure non-allergen sequences, laying a solid foundation for subsequent feature extraction and analysis.
[0077] Based on the above allergen sequence and the above third simplified non-allergen sequence, the above allergen characteristic peptide group was obtained.
[0078] Preferably, deduplication of the above-mentioned non-allergen sequences can remove redundant information, thereby simplifying the data structure and improving the accuracy of the data.
[0079] In this preferred embodiment, allergen characteristic peptide groups are obtained by using a preset allergen database and a preset protein database.
[0080] In another preferred embodiment, the allergen characteristic peptide set obtained based on the allergen sequence and the third simplified non-allergen sequence includes:
[0081] The allergen sequence was segmented into several allergen peptides by using a sliding window method according to a preset base length.
[0082] Calculate the similarity between each of the above-mentioned allergen peptide segments and each of the above-mentioned third simplified non-allergen sequences, and retain the allergen peptide segments with similarity not lower than a preset similarity threshold to obtain highly distinguishable allergen peptide segments.
[0083] Specifically, the BLAST tool is used to compare the similarity of each allergen peptide with the third simplified non-allergen sequence, obtaining the similarity E between them. The preset similarity is 0.5. The resulting high-discrimination allergen peptide is an allergen-specific peptide. Schematic illustration of the characteristic-specificity curve between the allergen peptide and the third simplified non-allergen sequence is shown in Figure 2, where the "E value" in Figure 2 represents the aforementioned similarity E.
[0084] High-resolution allergen peptides that are adjacent to each other on the same allergen sequence are spliced together to obtain the first allergen characteristic peptide.
[0085] Repeat the sequence alignment operation until several third allergen characteristic peptides are generated;
[0086] Based on the aforementioned third allergen characteristic peptide, the aforementioned allergen characteristic peptide group was obtained;
[0087] The sequence alignment operations mentioned above include:
[0088] Obtain the current second allergen characteristic peptide; wherein, the initial second allergen characteristic peptide is the aforementioned first allergen characteristic peptide;
[0089] Calculate the third identity between each current second allergen characteristic peptide and each of the above-mentioned non-allergen sequences;
[0090] Compare the current third identity with the aforementioned preset first identity threshold;
[0091] If the current third identity is greater than the preset first identity threshold, the second allergen feature peptide corresponding to the current third identity is removed, and the second allergen feature peptides that are adjacent to each other on the above allergen sequence are spliced together to obtain the updated second allergen feature peptide.
[0092] If the third identity of all the current second allergen characteristic peptides is not greater than the above-preset first identity threshold, the current second allergen characteristic peptide is spliced with another second allergen characteristic peptide that is adjacent to it on the above allergen sequence to obtain several third allergen characteristic peptides.
[0093] Specifically, high-resolution allergen peptides that fall on the same allergen sequence and are adjacent to each other are spliced together to form a preliminary allergen characteristic peptide (preAFFP), which is the first allergen characteristic peptide mentioned above. This ensures the continuity and integrity of the characteristic.
[0094] Specifically, a recursive feature comparison and elimination method is used to screen out the AFFP combination with minimal noise in a stepwise optimization manner, thereby generating information-rich and reliable allergen family feature peptides, namely several of the above-mentioned third allergen feature peptides.
[0095] Specifically, the first allergen characteristic peptide is compared with non-allergen sequences to identify peptides with high identity to the non-allergen sequences, and these peptides are marked as noise. Then, after removing the noise peptides, the remaining peptides are joined adjacently to strengthen the allergen-specific peptides, thereby improving the significance and reliability of the features. This recursive optimization method effectively reduces high-discrimination allergen peptides with low identity that may interfere with the accuracy of discrimination in each iteration. The recursive process continues until no more significant noise peptides are marked or removed, ultimately obtaining a noise-minimized AFFP combination, i.e., several third allergen characteristic peptides. This process ensures that the generated third allergen characteristic peptides have high specificity and reliability, accurately reflecting the core characteristics of the allergen.
[0096] In this preferred embodiment, a third allergen characteristic peptide is generated by repeatedly performing sequence alignment operations, and then an allergen characteristic peptide group is obtained based on the third allergen characteristic peptide.
[0097] In another preferred embodiment, the above-mentioned allergen characteristic peptide group obtained based on the above-mentioned third allergen characteristic peptide includes:
[0098] Based on the sequence characteristics of each third allergen characteristic peptide, a similarity matrix is constructed between each pair of third allergen characteristic peptides;
[0099] The similarity scores between each pair of the characteristic peptides of the third allergen are calculated based on the above similarity matrix.
[0100] The third allergen characteristic peptides with similarity scores not less than a preset similarity score threshold are clustered to obtain the above-mentioned allergen characteristic peptide group.
[0101] Specifically, the co-oscillation principle of the feature plane indicates that under the same biological function or specific effect, the expression patterns of different feature modules usually exhibit consistent fluctuation trends. That is, under similar conditions, their expression feature changes are similar and have strong synergy. Therefore, this embodiment, based on the co-oscillation principle of the feature plane, clusters and merges the third allergen feature peptides with highly correlated fluctuation trends, ultimately obtaining the aforementioned allergen feature peptide group. Here, a similarity matrix is constructed to quantify the correlation of fluctuation trends.
[0102] Preferably, this clustering method can group together functionally similar but loosely related third allergen characteristic peptides, making the clustered allergen characteristic peptide group more synergistic and representative, thereby improving the subsequent discrimination and generalization ability for unknown allergens.
[0103] In this preferred embodiment, an allergen characteristic peptide group is obtained by clustering the third allergen characteristic peptide.
[0104] The aforementioned distribution feature acquisition module is used to determine, based on the allergen and the allergen feature peptide group, the enriched allergen, the functional features corresponding to the enriched allergen, the enrichment degree and frequency of the allergen, and obtain the distribution features of the enriched allergen based on the enrichment degree, frequency and functional features.
[0105] In a preferred embodiment, based on the allergen and its characteristic peptide set, the enrichment of the allergen characteristic peptide set, the enriched allergen, the functional characteristics corresponding to the enriched allergen, the enrichment degree and frequency of the allergen are determined, including:
[0106] Calculate the frequency and enrichment of each type of allergen in each of the above-mentioned allergen characteristic peptide groups;
[0107] The allergen characteristic peptide group with an enrichment degree greater than the preset enrichment degree threshold is identified as the first enriched allergen characteristic peptide group. The allergen with an enrichment degree greater than the preset enrichment degree threshold is taken as the enriched allergen, and the functional characteristics corresponding to the enriched allergen are obtained.
[0108] The above-mentioned enriched allergen characteristic peptide group is generated by adding the corresponding enriched allergen functional characteristics to the first enriched allergen characteristic peptide group.
[0109] Specifically, the frequency and enrichment of each type of allergen in each allergen characteristic peptide group are calculated. When the enrichment of a certain type of allergen in a certain allergen characteristic peptide group is greater than 90% of the preset enrichment threshold, it indicates that this type of allergen has a significant enrichment characteristic in this allergen characteristic peptide group.
[0110] Specifically, the aforementioned functional characteristics include: molecular functions (such as protein binding, catalytic activity, etc.) and biological processes (such as cellular processes, metabolic processes, etc.).
[0111] Preferably, functional features are added to the first enriched allergen characteristic peptide group to ensure that it has clear biological meaning and distinguishing ability in the subsequent discrimination of unknown allergen sequences, thereby enhancing the accuracy and interpretability of the discrimination results.
[0112] In this preferred embodiment, the enriched allergen characteristic peptide group, the enriched allergen, the functional characteristics corresponding to the enriched allergen, the enrichment degree and frequency of the allergen are determined by the allergen and the allergen characteristic peptide group.
[0113] The aforementioned allergen identification feature set construction module is used to determine a stable allergen feature peptide set with high stability and a multi-targeting feature based on the aforementioned enriched allergen feature peptide set, and to combine the aforementioned multi-targeting feature, the aforementioned distribution feature and the aforementioned allergen feature peptide set to obtain an allergen identification feature set.
[0114] In a preferred embodiment, based on the above-mentioned enriched allergen characteristic peptide group, a stable allergen characteristic peptide group with high stability and multi-targeting characteristics is determined, including:
[0115] Calculate the correlation weights between different enriched allergen characteristic peptide groups, and based on the correlation weights and functional characteristics, merge enriched allergen characteristic peptide groups with correlation weights not less than a preset correlation weight value or with the same functional characteristics to obtain functional allergen characteristic peptide groups.
[0116] Specifically, based on correlation weights and functional characteristics, enriched allergen characteristic peptide groups with similar functions and related fluctuation trends on the same dimension can be identified. After merging these groups, they can be clustered into larger functional allergen characteristic peptide groups.
[0117] Preferably, in this way, enriched allergen characteristic peptide groups with completely overlapping functions or highly consistent fluctuation trends are merged to achieve effective redundancy contraction, further enhance the stability of functional allergen characteristic peptide groups, and make the functional orientation more specific.
[0118] Obtain the number of enriched allergens within the above-mentioned functional allergen characteristic peptide group, and the number of the above-mentioned third allergen characteristic peptide;
[0119] The functional allergen characteristic peptide group that has an enriched number of allergens not less than a preset enriched allergen number threshold and an enriched number of third allergen characteristic peptides not less than a preset third allergen characteristic peptide number threshold is defined as the stable allergen characteristic peptide group with high stability.
[0120] Specifically, a highly stable functional allergen characteristic peptide set meets the following conditions: it can enrich at least two or more enriched allergens and contains at least two or more third allergen characteristic peptides. Functional allergen characteristic peptide sets that meet these conditions are considered stable because they maintain consistent performance across different allergens and characteristic peptides, thus providing a more representative feature set for subsequent feature analysis and model construction. A schematic diagram of the stability curves of a stable allergen characteristic peptide set is shown in Figure 3. In Figure 3, the vertical axis represents a functional allergen characteristic peptide set containing at least the number of enriched allergens corresponding to the "preset enriched allergen number threshold" on the horizontal axis, and at least the number of third allergen characteristic peptides corresponding to the preset third allergen characteristic peptide number threshold. "Allergen" in Figure 3 represents the aforementioned enriched allergens.
[0121] Determine the number of characteristic peptide groups of different stable allergens corresponding to the same enriched allergen;
[0122] If the above quantity is not less than the preset quantity threshold, then the corresponding enriched allergen is determined to be a multi-targeted allergen.
[0123] Specifically, if a certain enriched allergen is significantly enriched in multiple different stable allergen characteristic peptide groups, it can be determined that it has multi-targeting properties.
[0124] Specifically, the above-mentioned multi-targeting can be further subdivided into the following three cases: multi-targeting of targets: the same enriched allergen is enriched in stable allergen characteristic peptide groups with completely different functions, indicating that the enriched allergen can play a role in multiple biological functions; multi-targeting within functional families: the enriched allergen is enriched in stable allergen characteristic peptide groups with similar functions but inconsistent fluctuation trends, these modules have low correlation and independent functions; combined multi-targeting: that is, it has the first two types of multi-targeting at the same time, and shows a dominant effect in a specific direction.
[0125] Based on the above-mentioned multi-targeted allergens, the above-mentioned multi-targeted characteristics were obtained.
[0126] In this preferred embodiment, a stable allergen-specific peptide group with high stability and multi-targeting characteristics was identified based on the enriched allergen-specific peptide group.
[0127] In another preferred embodiment, the multi-targeting characteristic obtained based on the multi-targeting allergen includes:
[0128] Obtain the enrichment, sequence characteristics, physicochemical characteristics, and structural characteristics of the above-mentioned multi-targeted allergens;
[0129] Based on principal component analysis, the enrichment degree, the sequence characteristics, the physicochemical characteristics, and the structural characteristics, principal component characteristics are obtained, and these principal component characteristics are used as the multi-targeting characteristics.
[0130] Specifically, the enrichment distribution of various multi-targeted allergens within the characteristic peptide groups of each functional allergen was first statistically analyzed. Simultaneously, considering their sequence characteristics, physicochemical properties, and structural features, the enrichment degree and characteristic performance were quantified using enrichment rate. Next, principal component analysis was used to reduce the dimensionality of these comprehensive data, identifying principal components that could explain most of the data variation, thereby extracting the most representative multi-targeted features.
[0131] In this preferred embodiment, a multi-targeting characteristic is obtained based on multi-targeting allergens.
[0132] The aforementioned unknown allergen feature vector generation module is used to obtain the unknown allergen feature vector based on the unknown allergen sequence to be identified and the aforementioned stable allergen feature peptide group.
[0133] In a preferred embodiment, obtaining the unknown allergen feature vector based on the unknown allergen sequence to be identified and the stable allergen characteristic peptide group includes:
[0134] The unknown allergen sequence was segmented according to the preset base length using the sliding window method to obtain several unknown allergen peptide segments.
[0135] The unknown allergen peptides are mapped to the stable allergen feature peptide groups to extract the first distribution features and the first multi-targeting features of the unknown allergen peptides in each stable allergen feature peptide group. Then, based on the first distribution features and the first multi-targeting features, the unknown allergen feature vector is generated.
[0136] Preferably, the unknown allergen sequence is preprocessed by standardization before using a sliding window for segmentation.
[0137] In this preferred embodiment, the unknown allergen sequence is segmented using the sliding window method to obtain several unknown allergen peptide segments. Then, an unknown allergen feature vector is generated using the location allergen peptide segments and the stable allergen feature peptide group.
[0138] The aforementioned discrimination result determination module is used to input the aforementioned unknown allergen feature vector into a preset allergen discrimination model trained based on the aforementioned allergen identification feature set, so that the aforementioned preset allergen discrimination model determines the discrimination result of the aforementioned unknown allergen sequence based on the aforementioned unknown allergen feature vector, the aforementioned allergen identification feature set, and the aforementioned stable allergen feature peptide set.
[0139] In a preferred embodiment, the preset allergen discrimination model determines the discrimination result of the unknown allergen sequence based on the unknown allergen feature vector, the allergen identification feature set, and the stable allergen feature peptide set, including:
[0140] Extract the second distribution feature and the second multi-target feature corresponding to the same allergen from the above allergen identification feature set, and generate an allergen feature vector based on the above second distribution feature and the second multi-target feature.
[0141] Based on the above-mentioned feature vectors of unknown allergens and the feature vectors of allergens corresponding to different allergens, the feature similarity is calculated.
[0142] Calculate the third enrichment degree of the above-mentioned unknown allergen feature vectors in each of the above-mentioned stable allergen feature peptide groups;
[0143] The number of stable allergen characteristic peptide groups with a third enrichment level greater than the preset third enrichment level is counted.
[0144] Based on the aforementioned feature similarity and the aforementioned first quantity, the discrimination result of the aforementioned unknown allergen sequence is determined.
[0145] Specifically, the aforementioned preset allergen discrimination model is a machine learning discrimination machine, which is trained based on the allergen identification feature set.
[0146] Preferably, training the model using an allergen identification feature set enables the pre-defined allergen discrimination model to learn and identify the feature patterns of different categories of allergens, ensuring its efficiency and wide applicability under different data conditions.
[0147] In this preferred embodiment, the discrimination result of the unknown allergen sequence is determined by a preset allergen discrimination model based on the feature vector of the unknown allergen, the above-mentioned allergen identification feature set, and the stable allergen feature peptide group.
[0148] In another preferred embodiment, determining the discrimination result of the unknown allergen sequence based on the aforementioned feature similarity and the aforementioned first quantity includes:
[0149] If the aforementioned feature similarity is greater than a preset feature similarity threshold, or if the aforementioned first quantity is greater than a preset first quantity threshold, then the aforementioned unknown allergen sequence is determined to be an allergen.
[0150] Specifically, if the aforementioned feature similarity is not greater than a preset feature similarity threshold and the aforementioned first quantity is not greater than a preset first quantity threshold, the aforementioned location allergen sequence is determined to be a non-allergen.
[0151] In this preferred embodiment, the discrimination result of the unknown allergen sequence is determined by feature similarity and a first quantity.
[0152] Preferably, 500 allergens for which no unified prediction results were found by the four existing software programs (ALLERGENFP, ALLERTOP, ALLERMATCH, SORTALLER) and 500 randomly selected non-allergens were selected as a validation set. The allergen discrimination system based on allergen family characteristic peptide groups of this invention was used for discrimination, and the accuracy (ACC), specificity, sensitivity, and Matthews correlation coefficient (MCC) of the discrimination were recorded as shown in the table below:
[0153] As shown in the table above, the allergen discrimination system based on allergen family characteristic peptide groups of the present invention achieves an accuracy of over 99% in experimental data, with a Matthews correlation coefficient exceeding 99%, significantly outperforming existing technologies. This indicates that the allergen discrimination system based on allergen family characteristic peptide groups of the present invention exhibits excellent generalization performance in identifying novel or mutated allergens.
[0154] The above are preferred embodiments of the present invention. It should be noted that, for those skilled in the art, several improvements and modifications can be made without departing from the principle of the present invention, and these improvements and modifications are also considered to be within the scope of protection of the present invention.
Claims
An allergen discrimination system based on an allergen family characteristic peptide group, characterized by, include: The module includes modules for determining allergen characteristic peptide groups, acquiring distribution features, constructing allergen identification feature sets, generating feature vectors for unknown allergens, and determining discrimination results. The allergen characteristic peptide group determination module is used to obtain the allergen characteristic peptide group based on a preset allergen database and a preset protein database. The distribution feature acquisition module is used to determine the enriched allergen characteristic peptide group, the enriched allergen, the functional features corresponding to the enriched allergen, the enrichment degree and frequency of the allergen based on the allergen and the allergen characteristic peptide group, and to obtain the distribution feature of the enriched allergen based on the enrichment degree, frequency and functional features. The allergen identification feature set construction module is used to determine a stable allergen feature peptide set with high stability and a multi-targeting feature based on the enriched allergen feature peptide set, and to combine the multi-targeting feature, the distribution feature and the allergen feature peptide set to obtain the allergen identification feature set. The unknown allergen feature vector generation module is used to obtain the unknown allergen feature vector based on the unknown allergen sequence to be identified and the stable allergen feature peptide group. The discrimination result determination module is used to input the unknown allergen feature vector into a preset allergen discrimination model trained based on the allergen identification feature set, so that the preset allergen discrimination model determines the discrimination result of the unknown allergen sequence based on the unknown allergen feature vector, the allergen identification feature set, and the stable allergen feature peptide set. The allergen discrimination system based on the allergen family characteristic peptide group according to claim 1, wherein, The step of obtaining allergen characteristic peptide groups based on a preset allergen database and a preset protein database includes: Allergen sequences are extracted from the preset allergen database; Non-allergen sequences are extracted from the preset protein database according to preset keywords, and duplicates are removed from the non-allergen sequences to obtain a first simplified non-allergen sequence. Calculate the first identity between the allergen sequence and the first simplified non-allergen sequence, retain the first simplified non-allergen sequence whose first identity is less than a preset first identity threshold, and obtain the second simplified non-allergen sequence; The second simplified non-allergen sequence is divided into several amino acid sequences according to a preset number of amino acids. The second identity of each amino acid sequence with the allergen sequence is calculated. The amino acid sequence with a second identity less than a preset second identity threshold and which does not contain a preset number of consecutive amino acid sequences is determined as the third simplified non-allergen sequence. The allergen characteristic peptide group is obtained based on the allergen sequence and the third simplified non-allergen sequence. The allergen discrimination system based on the allergen family characteristic peptide group according to claim 2, wherein, The step of obtaining the allergen characteristic peptide set based on the allergen sequence and the third simplified non-allergen sequence includes: The allergen sequence is segmented according to a preset base length using a sliding window method to obtain several allergen peptide segments; Calculate the similarity between each of the allergen peptide segments and each of the third simplified non-allergen sequences, and retain the allergen peptide segments with similarity not lower than a preset similarity threshold to obtain highly distinguishable allergen peptide segments; High-resolution allergen peptides that are adjacent to each other on the same allergen sequence are spliced together to obtain the first allergen characteristic peptide. Repeat the sequence alignment operation until several third allergen characteristic peptides are generated; The allergen characteristic peptide group is obtained based on the third allergen characteristic peptide; The sequence alignment operation includes: Obtain the current second allergen characteristic peptide; wherein, the initial second allergen characteristic peptide is the first allergen characteristic peptide; Calculate the third identity between each current second allergen characteristic peptide and each of the said non-allergen sequences; The current third identity is compared with the preset first identity threshold; If the current third identity is greater than the preset first identity threshold, the second allergen feature peptide corresponding to the current third identity is removed, and the second allergen feature peptides that are adjacent to each other on the allergen sequence are spliced together to obtain the updated second allergen feature peptide. If the third identity of all the current second allergen characteristic peptides is not greater than the preset first identity threshold, the current second allergen characteristic peptide is spliced with another second allergen characteristic peptide that is adjacent to the allergen sequence to obtain several third allergen characteristic peptides. The allergen discrimination system based on the allergen family characteristic peptide group according to claim 3, wherein, The process of obtaining the allergen characteristic peptide group based on the third allergen characteristic peptide includes: Based on the sequence characteristics of each third allergen characteristic peptide, a similarity matrix is constructed between each pair of third allergen characteristic peptides; The similarity scores between each pair of the characteristic peptides of the third allergen are calculated based on the similarity matrix. The third allergen characteristic peptides with similarity scores not less than a preset similarity score threshold are clustered to obtain the allergen characteristic peptide group. The allergen discrimination system based on the allergen family characteristic peptide group according to claim 4, characterized in that Based on the allergen and its characteristic peptide set, the enriched allergen characteristic peptide set, the enriched allergen, the corresponding functional characteristics of the enriched allergen, the enrichment degree of the allergen, and its frequency are determined, including: Calculate the frequency and enrichment of each type of allergen in each of the allergen characteristic peptide groups; The allergen characteristic peptide group with an enrichment degree greater than a preset enrichment degree threshold is identified as the first enriched allergen characteristic peptide group. The allergens with an enrichment degree greater than the preset enrichment degree threshold are used as enriched allergens, and the functional characteristics corresponding to the enriched allergens are obtained. The enriched allergen characteristic peptide group is generated by adding the functional features of the corresponding enriched allergens to the first enriched allergen characteristic peptide group. The allergen discrimination system based on the allergen family characteristic peptide group according to claim 5, wherein, Based on the enriched allergen characteristic peptide group, a stable allergen characteristic peptide group with high stability and multi-targeting characteristics was identified, including: Calculate the correlation weight between different enriched allergen characteristic peptide groups, and according to the correlation weight and the functional characteristics, merge enriched allergen characteristic peptide groups with correlation weight not less than a preset correlation weight value or with the same functional characteristics to obtain functional allergen characteristic peptide groups. Obtain the number of enriched allergens in the functional allergen characteristic peptide group and the number of the third allergen characteristic peptide; The functional allergen characteristic peptide group that has an enriched allergen number not less than a preset enriched allergen number threshold and an allergen third characteristic peptide number not less than a preset allergen third characteristic peptide number threshold is defined as the stable allergen characteristic peptide group with high stability. Determine the number of characteristic peptide groups of different stable allergens corresponding to the same enriched allergen; If the quantity is not less than a preset quantity threshold, then the corresponding enriched allergen is determined to be a multi-targeted allergen. The multi-targeting characteristics are obtained based on the multi-targeting allergens. The allergen discrimination system based on the allergen family characteristic peptide group according to claim 6, wherein, The process of obtaining the multi-targeting characteristics based on the multi-targeting allergen includes: Obtain the enrichment, sequence characteristics, physicochemical characteristics, and structural characteristics of the multi-targeted allergens; Principal component features are obtained based on principal component analysis, the enrichment degree, the sequence features, the physicochemical features, and the structural features, and these principal component features are used as the multi-targeting features. The allergen discrimination system based on the allergen family characteristic peptide group according to claim 7, wherein, The step of obtaining the unknown allergen feature vector based on the unknown allergen sequence to be identified and the stable allergen characteristic peptide group includes: The unknown allergen sequence is segmented according to the preset base length using the sliding window method to obtain several unknown allergen peptide segments; The unknown allergen peptide is mapped to the stable allergen feature peptide group to extract the first distribution feature and the first multi-targeting feature of the unknown allergen peptide in each stable allergen feature peptide group. Then, the unknown allergen feature vector is generated based on the first distribution feature and the first multi-targeting feature. The allergen discrimination system based on the allergen family characteristic peptide group according to claim 8, wherein, The preset allergen discrimination model determines the discrimination result of the unknown allergen sequence based on the unknown allergen feature vector, the allergen identification feature set, and the stable allergen feature peptide set, including: Extract the second distribution feature and the second multi-target feature corresponding to the same allergen from the allergen identification feature set, and generate an allergen feature vector based on the second distribution feature and the second multi-target feature; Based on the feature vector of the unknown allergen and the feature vector of the allergen corresponding to different allergens, the feature similarity is calculated. Calculate the third enrichment degree of the unknown allergen feature vector in each of the stable allergen feature peptide groups; The first number of stable allergen characteristic peptide groups with a third enrichment degree greater than a preset third enrichment degree is counted. The discrimination result of the unknown allergen sequence is determined based on the feature similarity and the first quantity. The allergen discrimination system based on the allergen family characteristic peptide group according to claim 9, wherein The step of determining the discrimination result of the unknown allergen sequence based on the feature similarity and the first quantity includes: If the feature similarity is greater than a preset feature similarity threshold, or if the first quantity is greater than a preset first quantity threshold, the unknown allergen sequence is determined to be an allergen.