Multi-source domain adaptive acetylation site prediction method and system

By using a Siamese hybrid neural network and a corresponding loss function for multi-source domain adaptive learning, the problem of the inability to effectively utilize multi-species data in existing technologies is solved, thereby improving the species-specific prediction performance of the species-specific prediction model and achieving efficient species-specific prediction results.

CN118486377BActive Publication Date: 2026-06-26ANHUI UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
ANHUI UNIV
Filing Date
2024-05-24
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing technologies cannot effectively utilize data from different species, which limits the predictive performance of models. Furthermore, sequence feature extraction relies on manual methods and cannot fully explore species-specific sequence patterns.

Method used

A Siamese hybrid neural network is used for multi-source domain adaptive learning. Through the Siamese hybrid neural network and the corresponding loss function, knowledge transfer is performed using acetylation data from multiple species. The KL divergence of features between each source domain and the target domain is calculated, the training weights are adjusted, and the local and long-range dependencies of protein sequences are extracted.

Benefits of technology

It achieves efficient species-specific acetylation site prediction, improves the model's generalization ability and prediction performance, alleviates negative migration, and enhances the prediction effect of acetylation post-translational modifications.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN118486377B_ABST
    Figure CN118486377B_ABST
Patent Text Reader

Abstract

The application provides a multi-source field adaptive acetylation site prediction method and system, which comprises the following steps: data collection and preprocessing, and construction of a data set; a twin mixed neural network is constructed, including a sequence feature extractor and a sequence feature classifier; a multi-source field adaptive method is used for training; the performance of the model is evaluated, and the model is updated by minimizing the total loss until the model converges on the validation set. The application solves the technical problems that the model prediction performance is restricted due to the inability to utilize different species data, and the sequence feature extraction depends on manual work.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of biomedical engineering, specifically to a method and system for predicting multi-source adaptive acetylation sites. Background Technology

[0002] Protein acetylation is the process by which an acetyl group from a donor is transferred to the ε-amino side chain of a lysine residue at the N-terminus of a protein under the action of lysine acetyltransferase. This modification was first discovered on the lysine residues of histones in eukaryotes. It is a reversible and highly regulated post-translational modification that is crucial for regulating protein structure and function. For example, studies have shown that histone acetylation inhibits the promoter activity, mRNA, and protein expression of ADRB2, providing new insights for the prevention and treatment of clinical asthma [Hao Zhongfen, Du Juan, Hang Yun, et al. Histone acetylation inhibits the expression of the asthma susceptibility gene ADR B2 [J]. Journal of Medical Research, 2024, 53(01):126-130.]. In addition, acetylation modification also participates in the nucleation immune response of Pinctada martensii. Lysine deacetylase inhibitors can act as immunomodulators to regulate the nucleation immune response of Pinctada martensii. Further research on acetylation modification can provide theoretical support for the rational control of the immune response of pearl oysters [Lu Jinzhao, Zhang Bin, Liang Haiying, et al. Effect of acetylation modification on nucleation immunity of Pinctada martensii [J / OL]. Journal of Fisheries of China, 1-10 [2024-03-18].].

[0003] Currently, there are various computation-based methods for predicting acetylation sites both domestically and internationally. For example, Wang et al. [Wang D, Liang Y, Xu D. Capsule network for protein post-translational modification site prediction[J]. Bioinformatics, 2019, 35(14):2386-2394.] successfully predicted lysine acetylation sites using CapsNet-PTM, outperforming traditional machine learning methods and standard convolutional neural networks. Meanwhile, Chen and his team proposed a method called MUscADEL [Chen Z, Liu X, Li F, et al. Large-scale comparative assessment of computational predictors for lysine post-translational modification sites[J]. Briefings in bioinformatics, 2019, 20(6):2267-2290.] based on bidirectional long short-term memory networks, which can accurately predict various post-translational modification sites, including acetylation, methylation, and ubiquitination. In addition, Chen et al. [Chen G, Cao M, Luo K, et al. ProAcePred: prokaryote lysine acetyl ation sites prediction based on elastic net feature optimization[J]. Bioinformatics,2018,34(23):3999-4006.] proposed a species-specific lysine acetylation site prediction method called ProAcePred, which uses an elastic network to optimize features, thereby significantly improving prediction performance. Liu et al. [Y.Liu,Q.Wang,and J.Xi,"DeepDA-Ace:A NovelDomain Adaptation Method for Species-Specific Acetylation Site,"vol.10,p.2364,2022] proposed an acetylation site prediction method called DeepDA-Ace, which effectively uses transfer learning technology to predict acetylation sites in species with small datasets.

[0004] However, these methods still have the following drawbacks: MUscADEL and CapsNet-PTM are both based on data from multiple species to build general prediction models, failing to fully mine species-specific sequence motifs, resulting in limited prediction performance. Furthermore, although the ProAcePred method establishes a species-specific prediction model, it requires manual extraction of sequence features, making it difficult to effectively mine protein sequence patterns in large-scale data samples. While the DeepDA-Ace method utilizes transfer learning techniques to transfer relevant knowledge from large amounts of human data to assist in predicting acetylation sites in other species, improving the prediction performance of species-specific acetylation sites, this method only uses human data as the source domain for knowledge transfer and fails to effectively utilize data from other species, resulting in limited prediction capabilities. Therefore, it is necessary to develop multi-source domain adaptive learning methods to transfer knowledge learned from multiple species to the target species, thereby establishing more efficient species-specific prediction models.

[0005] The existing invention patent application document CN114724629A, entitled "A Method and System for Predicting Species-Specific Protein Post-Translation Modification Sites," describes a method comprising: obtaining post-translational modification sites and training samples from different species through data preprocessing; constructing a sequence feature extraction network and setting up a classifier and a domain category discriminator; processing the training samples from different species using a semantic adversarial strategy to obtain sample pairing groups; training the sequence feature extraction network and classifier using the human post-translational modification samples, training the domain category discriminator based on the sample pairing groups to distinguish the group information of input sample pairs, and alternately training the sequence feature extraction network and the domain category discriminator so that the sequence feature extraction network, the classifier, and the domain category discriminator learn a domain-invariant discriminative feature space; and evaluating the performance of the deep neural network. And the existing invention patent application document with publication number CN116631498A, entitled "Semi-supervised method and system for predicting post-translational modification sites of lysine in proteins", includes: collecting and preprocessing data to construct a dataset; designing a network structure to construct a multi-scale sequence feature extraction network; training the model using an unlabeled sample self-distillation method; and evaluating the prediction results using indicators such as accuracy and recall.

[0006] The technical solutions disclosed in the two aforementioned existing papers both employ CNN structures to extract protein sequence features, failing to effectively extract long-range dependencies between protein sequences, thus limiting the discriminative power of the extracted features. Furthermore, both of these published papers use only the human species as the source domain for knowledge transfer. Since the sequences of some target species differ significantly from those of humans, this can easily lead to negative transfer, impacting prediction performance.

[0007] In summary, existing technologies suffer from limitations in model prediction performance due to the inability to utilize data from different species, and the reliance on manual extraction of sequence features. Summary of the Invention

[0008] The technical problem to be solved by this invention is: how to solve the technical problems in the prior art where the model prediction performance is limited due to the inability to utilize data from different species, and the dependence of sequence feature extraction on manual labor.

[0009] This invention solves the above-mentioned technical problems by employing the following technical solution: a multi-source domain adaptive acetylation site prediction method comprising:

[0010] S1. Collect acetylation modification information to extract labeled positive samples and select negative samples based on the modification sites. Use preset tools to remove proteins with similarity greater than a preset threshold and randomly select an independent test set from the labeled positive samples.

[0011] S2. Construct a Siamese hybrid neural network, wherein the Siamese hybrid neural network consists of a sequence feature extractor and a sequence feature classifier; wherein, based on the number of source domains and target domains, assuming there are n source domains and 1 target domain, the number of sequence feature classifiers and sequence feature extractors is set; the sequence feature extractor uses one-hot encoded sequence features as input and performs a one-dimensional convolution operation to increase the number of channels of the sequence features; the sequence feature classifier predicts the classification result by performing a nonlinear transformation on the sequence features of a preset dimension;

[0012] S3. Perform adaptive learning for whales in multiple source domains. Use post-translation modified samples from all species as input. During the training of the Siamese hybrid neural network, calculate the KL divergence of the source domain sample features and the target domain sample features. Determine the training weights of the Siamese hybrid neural network for each source domain based on the similarity between the source and target domains. Minimize the distance between the source and target domain features to perform feature adaptation operations on the source and target domains and obtain a suitable Siamese hybrid neural network.

[0013] S4. Evaluate the performance of the applicable Siamese hybrid neural network based on the independent test set.

[0014] This invention utilizes multi-source domain adaptive learning and leverages acetylation data from multiple species for knowledge transfer, enabling efficient prediction of species-specific acetylation sites. It effectively utilizes data from multiple other species, enhancing domain adaptability, mitigating negative transfer, and improving the prediction performance of species-specific acetylation sites. This invention demonstrates excellent predictive performance for acetylation post-translational modifications, providing a crucial foundation for studying the biological functions of acetylation modifications.

[0015] In a more specific technical solution, S1 includes:

[0016] S11. Collect acetylation modification information from a pre-set database, and extract local sequences based on the extraction length of the modification sites to serve as labeled positive samples.

[0017] S12. From proteins containing verified acetylation sites, select unacetylated lysine sites as counterexamples.

[0018] S13. For labeled positive and negative samples, use the CD-HIT tool to remove proteins with similarity greater than a preset threshold.

[0019] S14. According to the preset ratio, obtain the independent test set from the labeled positive sample.

[0020] This invention uses the CD-HIT tool to remove proteins with a similarity of more than 40% for all data, in order to avoid performance optimization due to homology.

[0021] In a more specific technical solution, S2 uses a Long Short-Term Memory (LSTM) network to extract global contextual information from sequence features to obtain sequence features of a preset dimension. The LSTM network includes hidden layers. The sequence feature extractor includes a convolutional neural network and a LSTM network.

[0022] In a more specific technical solution, in S2, a sequence feature classifier is used to perform a linear transformation on the sequence features of a preset dimension to reduce the feature dimension; according to the threshold probability parameter, a random deactivation operation is performed, and the ReLU function is used to learn nonlinear features and perform a quadratic linear transformation to reduce the dimension and obtain 2-dimensional sequence features; the softmax activation is used to calculate the binary classification probability to obtain the classification result.

[0023] In a more specific technical solution, S2 utilizes a convolutional neural network to extract local features of protein sequences and a long short-term memory (LSTM) network to extract long-distance dependencies between protein sequences.

[0024] In a more specific technical solution, in S2, the sequence feature classifier includes a fully connected neural network, wherein the last layer of the fully connected neural network outputs no less than two predicted scores.

[0025] In a more specific technical solution, S3 includes:

[0026] S31. Suppose that the multi-source adaptive domain has n source domains and one target domain;

[0027] S32, Given n different source domains Target domain (X) TThe data is input into a Siamese hybrid neural network, processed through a shared feature extraction network, and then yields n source domain features. Target domain feature E T The source domain output features are obtained by processing the corresponding classifier. Target domain output feature H T .

[0028] S33. Normalize the output data of the sequence feature extractor using the softmax function to obtain normalized source domain features. Normalized target domain features F T Calculate the KL divergence for each normalized source domain feature and normalized target domain feature; the KL divergence is obtained using the following logic:

[0029]

[0030] In the formula, M represents the dimension of the feature. Represents the i-th dimension feature of the target domain. Let i represent the i-th dimension feature of the k-th source domain.

[0031] S34. Based on the KL divergence, calculate the weight coefficients to serve as the training weights for each source domain on the Siamese hybrid neural network.

[0032]

[0033] S35 calculates the training loss of the Siamese hybrid neural network based on the data from each source domain.

[0034] S36. Determine the KL divergence loss function: The divergence loss function is used to constrain the similarity between the source and target domains;

[0035] S37. Determine the total loss function.

[0036] Compared with existing lysine modification site prediction methods, this invention can effectively utilize acetylation data from multiple other species, improve the generalization ability and prediction effect of species-specific prediction models, and is an important foundation for studying the biological functions of lysine modifications.

[0037] This invention uses post-translationally modified samples from all species as input. During training, it calculates the KL divergence between the features of each source domain sample and the target domain sample, and determines the weight of each source domain data for model training based on the similarity between the source and target domains. Simultaneously, it minimizes the distance between the features of each source and target domain, achieving feature adaptation between the source and target domains.

[0038] In a more specific technical solution, within S35, the training loss is determined using the following logic:

[0039]

[0040] Determine the classification loss function using the following logic.

[0041]

[0042] In the formula, Let y be the classification score of the i-th sample in the k-th source domain. k,i Let N be the label of the i-th sample. k is the total number of training samples for the k-th source domain.

[0043] In a more specific technical solution, S37 uses the following logic to determine the total loss function:

[0044]

[0045] The classification loss function for the target domain data is determined using the following logic:

[0046]

[0047] In the formula, Let y be the classification score of the i-th sample in the target domain. i Let be the label of the i-th sample, and N be the total number of training samples in the target domain.

[0048] This study uses multiple species as source domains and designs a Siamese hybrid neural network and corresponding loss function, so that the model can adaptively adjust the training weights according to the similarity between each source domain and the target species.

[0049] In a more specific technical solution, the multi-source domain adaptive acetylation site prediction system includes:

[0050] The data collection module is used to collect acetylation modification information, extract labeled positive samples based on the modification site, select negative samples, remove proteins with similarity greater than a preset threshold using preset tools, and randomly select an independent test set from the labeled positive samples.

[0051] The network construction module is used to build a Siamese hybrid neural network, which includes a sequence feature extractor and a sequence feature classifier. The number of sequence feature classifiers and sequence feature extractors is determined based on the number of source and target domains (assuming n source domains and one target domain). The sequence feature extractor uses one-hot encoded sequence features as input and performs a one-dimensional convolution operation to increase the number of channels in the sequence features. The sequence feature classifier predicts the classification result by performing a nonlinear transformation on the sequence features of a preset dimension. The network construction module is connected to the data collection module.

[0052] The network adaptive learning module is used for adaptive learning of whales in multiple source domains. It uses post-translation modified samples from all species as input. During the training of the Siamese hybrid neural network, it calculates the KL divergence of the features of each source domain sample and the features of the target domain sample. Based on the similarity between the source domain and the target domain, it determines the training weights of each source domain data for the Siamese hybrid neural network. It minimizes the distance between the features of each source domain and the features of the target domain to perform feature adaptation operations on the source domain and the target domain, and obtains a suitable Siamese hybrid neural network. The network adaptive learning module is connected to the network construction module.

[0053] The performance evaluation module is used to evaluate the performance of the applicable Siamese hybrid neural network based on an independent test set. The performance evaluation module is connected to the network adaptive learning module.

[0054] The present invention has the following advantages over the prior art:

[0055] This invention utilizes multi-source domain adaptive learning and leverages acetylation data from multiple species for knowledge transfer, enabling efficient prediction of species-specific acetylation sites. It effectively utilizes data from multiple other species, enhancing domain adaptability, mitigating negative transfer, and improving the prediction performance of species-specific acetylation sites. This invention demonstrates excellent predictive performance for acetylation post-translational modifications, providing a crucial foundation for studying the biological functions of acetylation modifications.

[0056] This invention uses the CD-HIT tool to remove proteins with a similarity of more than 40% for all data, in order to avoid performance optimization due to homology.

[0057] Compared with existing lysine modification site prediction methods, this invention can effectively utilize acetylation data from multiple other species, improve the generalization ability and prediction effect of species-specific prediction models, and is an important foundation for studying the biological functions of lysine modifications.

[0058] This invention uses post-translationally modified samples from all species as input. During training, it calculates the KL divergence between the features of each source domain sample and the target domain sample, and determines the weight of each source domain data for model training based on the similarity between the source and target domains. Simultaneously, it minimizes the distance between the features of each source and target domain, achieving feature adaptation between the source and target domains.

[0059] This study uses multiple species as source domains and designs a Siamese hybrid neural network and corresponding loss function, so that the model can adaptively adjust the training weights according to the similarity between each source domain and the target species.

[0060] This invention solves the technical problems in the prior art, such as the inability to utilize data from different species, which limits the predictive performance of the model, and the reliance on manual extraction of sequence features. Attached Figure Description

[0061] Figure 1 This is a schematic diagram of the basic steps of the multi-source domain adaptive acetylation site prediction method in Embodiment 1 of the present invention;

[0062] Figure 2 This is a schematic diagram of the model structure of the multi-source domain adaptive acetylation site prediction method in Embodiment 1 of the present invention;

[0063] Figure 3a This is a schematic diagram showing the comparison of ROC curves of the present invention and existing methods on rats in Example 2 of the present invention;

[0064] Figure 3b This is a schematic diagram showing the comparison of ROC curves of the present invention and other existing methods on yeast in Embodiment 2 of the present invention;

[0065] Figure 3c This is a schematic diagram showing the comparison of ROC curves of the present invention and other existing methods on Bacillus in Embodiment 2 of the present invention. Detailed Implementation

[0066] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below in conjunction with the embodiments of the present invention. Obviously, the described embodiments are only some embodiments of the present invention, not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0067] Example 1

[0068] like Figure 1 and Figure 2 As shown, the multi-source domain adaptive acetylation site prediction method provided by this invention includes the following basic steps:

[0069] Step S1: Data collection and preprocessing;

[0070] In this embodiment, an appropriate amount of acetylation modification information is collected from a public database, and a local sequence of length 31 is extracted based on the modification site as a labeled positive sample; from proteins containing at least one experimentally verified acetylation site, unacetylated lysine sites are selected as negative samples; for all data, the CD-HIT tool is used to remove proteins with similarity exceeding 40% to avoid performance optimization due to homology; 10% of the samples are randomly selected from the labeled data as an independent test set.

[0071] In the process of constructing the dataset in this embodiment, a large amount of acetylation modification information is collected from public databases. Fifteen amino acids are taken above and below the acetylation site to form a protein sequence fragment of length 31 as a positive example sample. From proteins containing at least one experimentally verified acetylation site, lysine without any acetylation information is selected as a negative example sample.

[0072] Step S2: Construct a Siamese hybrid neural network;

[0073] In this embodiment, the construction of the Siamese hybrid neural network includes, but is not limited to, a sequence feature extractor and a sequence feature classifier. In this embodiment, assuming there are n source domains and one target domain, the network has n+1 identical sequence feature classifiers and one sequence feature extractor. The sequence feature extractor takes one-hot encoded sequence features as input. First, it performs a one-dimensional convolution operation on the input data of size 31*21, where the kernel size is 3*1 and the number of kernels is 128, increasing the number of feature channels from 21 to 128. Then, an LSTM layer with 128 hidden layers is used to further extract the global context information of the sequence, ultimately obtaining 128*31 dimensional sequence features.

[0074] In this embodiment, the sequence feature classifier predicts the classification result by performing a nonlinear transformation on the data output by the feature extractor. Specifically, the sequence feature classifier performs a linear transformation on the features output by the feature extractor, reducing the feature dimension to 64 dimensions; it then performs random deactivation with a probability of 0.2 and learns nonlinear features using the ReLU function, followed by a linear transformation to reduce the features to 2 dimensions; finally, it uses softmax activation to calculate the probability of binary classification, which is the model's output.

[0075] In the network structure design of this embodiment, the sequence feature extractor includes, but is not limited to, convolutional neural networks (CNNs) and long short-term memory (LSTM) networks. In this embodiment, the CNN is used to extract local features of the protein sequence, and the LSTM is used to extract long-distance dependencies between protein sequences. In this embodiment, the combination of CNNs and LSTM is beneficial for extracting highly discriminative post-translational modification features, which are ultimately used for site prediction. In this embodiment, the sequence feature classifier includes, but is not limited to, a fully connected neural network, with the last layer outputting two predicted scores.

[0076] Step S3: Train using a multi-source domain adaptive method;

[0077] In this embodiment, it is assumed that the multi-source domain adaptation has n source domains and one target domain. The aforementioned Siamese hybrid neural network accepts n different source domains. and the target domain (X)T Using the data as input, n source domain features are obtained through a shared feature extraction network. and target domain features E T These features are then processed by the corresponding classifier to obtain the output. and H T .

[0078] In this embodiment, the output of the feature extractor is normalized using the softmax function to obtain normalized features. and F T ; Calculate the KL divergence between the features of each source domain and the features of the target domain;

[0079] In this embodiment, the formula for calculating the KL divergence is as follows:

[0080]

[0081] Where M represents the dimension of the feature, Represents the i-th dimension feature of the target domain. Let i represent the i-th dimension feature of the k-th source domain.

[0082] A weight coefficient is calculated based on the KL divergence between each source domain and the target domain, which serves as the weight for training the model for each source domain.

[0083]

[0084] In this embodiment, the training loss of the model for each source domain data is as follows:

[0085]

[0086] in, Let be the classification loss function, where Let y be the classification score of the i-th sample in the k-th source domain. k,i Let N be the label of the i-th sample. k is the total number of training samples for the k-th source domain. This is the KL divergence loss function, used to constrain the similarity between the source and target domains.

[0087] In this embodiment, the total loss function is:

[0088]

[0089] in, Let be the classification loss function for the target domain data, where Let y be the classification score of the i-th sample in the target domain. i Let be the label of the i-th sample, and N be the total number of training samples in the target domain.

[0090] This embodiment employs a multi-source domain adaptive method. To fully utilize post-translation modification data from different species, it utilizes whale-adaptive multi-source domain adaptive learning to effectively incorporate data from different species into the training of the target species prediction model. This invention uses post-translation modification samples from all species as input. During training, it calculates the KL divergence between the features of each source domain sample and the target domain sample, determining the weight of each source domain data in the model training based on the similarity between the source and target domains. Simultaneously, it minimizes the distance between the features of each source and target domain, achieving feature adaptation between the source and target domains.

[0091] Step S4: Evaluate the performance of the model;

[0092] In this embodiment, the model is updated by minimizing the total loss until the model converges to the validation set. In this embodiment, AUC (area under the ROC curves), a commonly used metric in protein post-translational modification site prediction tasks, is used for performance evaluation. The AUC value is the area under the ROC curve, and its range can be set to, for example, [0,1], with a higher score indicating better performance.

[0093] In the performance evaluation process of this embodiment, the receiver operating characteristic curve ROC and the area under the curve AUC are used to evaluate the prediction results.

[0094] Example 2

[0095] We comprehensively evaluated the performance of the method of this invention. The embodiments of this invention calculated its evaluation metrics on a three-species lysine acetylation test set and compared them with other existing prediction tools and methods, as shown in the results. Figure 2 As shown in the figure, the present invention, MDLep-Ace, not only outperforms existing general acetylation site prediction methods CapsNet-PTM and PAIL, but also surpasses the species-specific acetylation site prediction method, DeepDA-Ace. For example, compared to CapsNet-PTM and PAIL, the present invention shows a significant improvement in AUC values ​​on rats, yeast, and Bacillus, demonstrating the necessity of establishing species-specific models for accurately predicting acetylation sites in different species. Furthermore, the performance indicators of the present invention on multiple species also show a certain degree of improvement compared to the species-specific acetylation prediction method DeepDA-Ace, indicating that using acetylation data from multiple species for transfer learning can further enhance the model's generalization ability and improve prediction performance.

[0096] The above results demonstrate that the multi-source domain adaptive transfer learning method proposed in this invention can effectively utilize post-translational modification data from multiple species, thereby improving the predictive performance of species-specific models.

[0097] In summary, this invention utilizes multi-source domain adaptive learning and leverages acetylation data from multiple species for knowledge transfer, enabling efficient prediction of species-specific acetylation sites. This invention effectively utilizes data from multiple other species, enhancing domain adaptability, mitigating negative transfer, and improving the prediction performance of species-specific acetylation sites. This invention demonstrates excellent predictive performance for acetylation post-translational modifications and provides an important foundation for studying the biological functions of acetylation modifications.

[0098] This invention uses the CD-HIT tool to remove proteins with a similarity of more than 40% for all data, in order to avoid performance optimization due to homology.

[0099] Compared with existing lysine modification site prediction methods, this invention can effectively utilize acetylation data from multiple other species, improve the generalization ability and prediction effect of species-specific prediction models, and is an important foundation for studying the biological functions of lysine modifications.

[0100] This invention uses post-translationally modified samples from all species as input. During training, it calculates the KL divergence between the features of each source domain sample and the target domain sample, and determines the weight of each source domain data for model training based on the similarity between the source and target domains. Simultaneously, it minimizes the distance between the features of each source and target domain, achieving feature adaptation between the source and target domains.

[0101] This study uses multiple species as source domains and designs a Siamese hybrid neural network and corresponding loss function, so that the model can adaptively adjust the training weights according to the similarity between each source domain and the target species.

[0102] This invention solves the technical problems in the prior art, such as the inability to utilize data from different species, which limits the predictive performance of the model, and the reliance on manual extraction of sequence features.

[0103] The above embodiments are only used to illustrate the technical solutions of the present invention, and are not intended to limit it. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A multi-source domain adaptive acetylation site prediction method, characterized in that, The method includes: S1. Collect acetylation modification information to extract labeled positive samples and select negative samples based on the modification sites. Use preset tools to remove proteins with similarity greater than a preset threshold and randomly select an independent test set from the labeled positive samples. S2. Construct a Siamese hybrid neural network, wherein the Siamese hybrid neural network comprises a sequence feature extractor and a sequence feature classifier; wherein, based on the number of source domains and target domains, assuming there are n source domains and 1 target domain, the number of the sequence feature classifier and the sequence feature extractor is set; the sequence feature extractor uses the one-hot encoded sequence features as input to perform a one-dimensional convolution operation to increase the number of channels of the sequence features; The Long Short-Term Memory (LSTM) network is used to extract global context information from the sequence features to obtain sequence features of a preset dimension; the LSTM network includes: hidden layers; The sequence feature classifier predicts the classification result by performing a nonlinear transformation on the sequence features of the preset dimension; S3. Perform adaptive learning for whales in a multi-source domain. Using post-translation modified samples from all species as input, calculate the KL divergence of source domain sample features and target domain sample features during the training process of the Siamese hybrid neural network. Determine the training weights of the data from each source domain for the Siamese hybrid neural network based on the similarity between the source and target domains. Minimize the distance between each source domain feature and the target domain feature to perform feature adaptation operations on the source and target domains, thereby obtaining a suitable Siamese hybrid neural network. S3 includes: S31. Suppose that the multi-source adaptive domain has n source domains and one target domain; S32, The n different source domains ( The target domain ( The data is input into the Siamese hybrid neural network, and after passing through a shared feature extraction network, n source domain features are obtained. Target domain features The source domain output features are obtained by processing the corresponding classifier. Target domain output features ; S33. Normalize the output data of the sequence feature extractor using the softmax function to obtain normalized source domain features. Normalized target domain features Calculate the KL divergence for each of the normalized source domain features and the normalized target domain features; wherein the KL divergence is obtained using the following logic: In the formula, M represents the dimension of the feature. Represents the first of the target domains i dimensional features, Indicates the first k The first source domain i Dimensional features; S34. Calculate the weight coefficients based on the KL divergence to serve as the training weights for each source domain in the Siamese hybrid neural network. S35 calculates the data for each of the source domains, and the training loss for the Siamese hybrid neural network; The training loss is determined using the following logic: Determine the classification loss function using the following logic. : ; In the formula, For the first k The classification score of the i-th sample in each source domain. Let be the label of the i-th sample. For the first k The total number of training samples in each source domain; S36. Determine the KL divergence loss function: The similarity between the source domain and the target domain is constrained using the divergence loss function. S37. Determine the total loss function; The total loss function is determined using the following logic: The classification loss function for the target domain data is determined using the following logic: ; In the formula, For the first in the target domain i The classification score of each sample. Let be the label of the i-th sample. The total number of training samples for the target domain; S4. Evaluate the performance of the applicable Siamese hybrid neural network based on the independent test set.

2. The multi-source domain adaptive acetylation site prediction method according to claim 1, characterized in that, S1 includes: S11. Collect the acetylation modification information from the preset database, and extract the local sequence according to the extraction length of the modification site to serve as the labeled positive sample; S12. From the proteins containing the verified acetylation sites, select unacetylated lysine sites as the counterexample samples; S13. For the labeled positive sample and the labeled negative sample, use the CD-HIT tool to remove the proteins whose similarity is greater than the preset threshold; S14. Obtain the independent test set from the labeled positive sample according to the preset ratio.

3. The multi-source domain adaptive acetylation site prediction method according to claim 1, characterized in that, In S2, the sequence feature extractor includes: a convolutional neural network and a long short-term memory network (LSTM).

4. The multi-source domain adaptive acetylation site prediction method according to claim 1, characterized in that, In step S2, the sequence feature classifier is used to perform a linear transformation on the preset dimension sequence features to reduce the feature dimension; according to the threshold probability parameter, a random deactivation operation is performed, and the ReLU function is used to learn nonlinear features and perform a quadratic linear transformation to reduce the dimension and obtain 2-dimensional sequence features. The binary classification probability is calculated using softmax activation to obtain the classification result.

5. The multi-source domain adaptive acetylation site prediction method according to claim 1, characterized in that, In step S2, the convolutional neural network is used to extract local features of the protein sequence; and the long short-term memory network LSTM is used to extract long-distance dependencies between the protein sequences.

6. The multi-source domain adaptive acetylation site prediction method according to claim 1, characterized in that, In S2, the sequence feature classifier includes a fully connected neural network, wherein the last layer of the fully connected neural network outputs no less than two predicted scores.

7. A multi-source domain adaptive acetylation site prediction system, characterized in that, The system includes: The data collection module is used to collect acetylation modification information, extract labeled positive samples and select negative samples based on the modification site, remove proteins with similarity greater than a preset threshold using preset tools, and randomly select an independent test set from the labeled positive samples. A network construction module is used to construct a Siamese hybrid neural network, wherein the Siamese hybrid neural network comprises a sequence feature extractor and a sequence feature classifier; wherein, based on the number of source domains and target domains, assuming there are n source domains and 1 target domain, the number of the sequence feature classifier and the sequence feature extractor is set; the sequence feature extractor uses one-hot encoded sequence features as input and performs a one-dimensional convolution operation to increase the number of channels of the sequence features; the sequence feature classifier predicts the classification result by performing a nonlinear transformation on the sequence features of a preset dimension, and uses a Long Short-Term Memory (LSTM) network to extract global context information from the sequence features to obtain the sequence features of the preset dimension; the LSTM network includes a hidden layer, and the network construction module is connected to the data collection module; A network adaptive learning module is used for adaptive learning of whales in multiple source domains. It utilizes post-translationally modified samples from all species as input. During the training of the Siamese hybrid neural network, it calculates the KL divergence of features from each source domain and the features from the target domain. Based on the similarity between the source and target domains, it determines the training weights of each source domain's data for the Siamese hybrid neural network. It minimizes the distance between features from each source domain and the target domain to perform feature adaptation operations on the source and target domains, thereby obtaining a suitable Siamese hybrid neural network. The network adaptive learning module is connected to the network construction module. In this context, we assume that the multi-source domain adaptive model has n source domains and one target domain. n different source domains ( The target domain ( The data is input into the Siamese hybrid neural network, and after passing through a shared feature extraction network, n source domain features are obtained. Target domain features The source domain output features are obtained by processing the corresponding classifier. Target domain output features ; The output data of the sequence feature extractor is normalized using the softmax function to obtain normalized source domain features. Normalized target domain features Calculate the KL divergence for each of the normalized source domain features and the normalized target domain features; wherein the KL divergence is obtained using the following logic: In the formula, M represents the dimension of the feature. Represents the first of the target domains i dimensional features, Indicates the first k The first source domain i Dimensional features; Based on the KL divergence, weight coefficients are obtained to serve as the training weights for each source domain in the Siamese hybrid neural network. Calculate the training loss of the Siamese hybrid neural network based on the data from each of the source domains; The training loss is determined using the following logic: Determine the classification loss function using the following logic. : ; In the formula, For the first k The classification score of the i-th sample in each source domain. Let be the label of the i-th sample. For the first k The total number of training samples in each source domain; Determine the KL divergence loss function: The similarity between the source domain and the target domain is constrained using the divergence loss function. Determine the total loss function; The total loss function is determined using the following logic: The classification loss function for the target domain data is determined using the following logic: ; In the formula, For the first in the target domain i The classification score of each sample. Let be the label of the i-th sample. The total number of training samples for the target domain; A performance evaluation module is used to evaluate the performance of the applicable Siamese hybrid neural network based on the independent test set, and the performance evaluation module is connected to the network adaptive learning module.