Protein sequence classification positioning system and method based on attention reinforcement network

CN118212992BActive Publication Date: 2026-06-19SHANGHAI JIAOTONG UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SHANGHAI JIAOTONG UNIV
Filing Date
2024-03-28
Publication Date
2026-06-19

Smart Images

  • Figure CN118212992B_ABST
    Figure CN118212992B_ABST
Patent Text Reader

Abstract

A protein sequence classification and localization system and method based on attention-enhanced networks includes: a sequence preprocessing module, a sequence characterization embedding module, an attention-enhanced network, an important fragment extractor, an important site visualization module, a fragment screening and extraction module, and a classification predictor module. This invention fully leverages the advantages of attention-based networks, such as ease of visualization and strong interpretability. By combining it with a large protein language model, it can fully utilize the biological knowledge learned by the large language model during pre-training, and solve the problem of effectively combining large language models to predict the biological characteristics of protein sequences and identify and mine key regions and important pattern features on protein sequences in related technologies.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to a technology in the field of genetic engineering, specifically a protein sequence classification and localization system and method based on attention enhancement networks. Background Technology

[0002] With the rapid development of genomics and proteomics research, databases have accumulated a vast amount of DNA and protein sequences. Therefore, computational biologists need to develop practical tools to extract relevant biological information from these sequences for functional annotation. However, protein function is often not determined by the overall sequence, but rather by certain segments with important patterns. For example, the nuclear localization signal is a 5-40 nucleotide sequence fragment within the protein sequence, and its presence or absence largely determines whether the protein sequence can successfully enter the cell nucleus and perform its corresponding biological function. While numerous predictive models exist that can provide general analyses of protein function and global attributes, they lack further explanation; that is, the models cannot reproduce the decision-making process and do not know which specific segment or site within the sequence plays a crucial role. Summary of the Invention

[0003] To address the aforementioned shortcomings of existing technologies, this invention proposes a protein sequence classification and localization system and method based on attention enhancement networks. This system fully leverages the advantages of attention-based networks, such as ease of visualization and strong interpretability. By combining it with a large protein language model, it can fully utilize the biological knowledge learned by the large language model during pre-training. This addresses the problem in related technologies of effectively combining large language models to predict the biological characteristics of protein sequences and to identify and mine key regions and important pattern features on protein sequences.

[0004] This invention is achieved through the following technical solution:

[0005] This invention relates to a protein sequence classification and localization system based on an attention-enhanced network, comprising: a sequence preprocessing module, a sequence representation embedding module, an attention-enhanced network, an important fragment extractor, an important site visualization module, a fragment selection and extraction module, and a classification predictor module. Specifically: the sequence preprocessing module segments sequences whose length exceeds the longest input of a pre-trained model based on sequence length information, obtaining sequence fragments that conform to the input of the pre-trained model; the sequence representation embedding module embeds representations of the sequence fragments using the protein pre-trained model esm1b based on the input sequence fragment information, obtaining a vector representation matrix of the sequence fragments; the attention-enhanced network multiplies the protein fragment representations according to the attention distribution generated by multiplying the representations by the attention mechanism, multiplying the representations according to the attention weights corresponding to each position, to obtain an attention-adjusted protein fragment table. The important fragment extractor selects sites with high distribution weights based on the attention distribution information obtained from the attention enhancement network and performs fusion processing to obtain recommended fragment results. The important site visualization module visualizes the attention weight of each amino acid at each site as the height of the amino acid letter based on the attention distribution obtained from the attention enhancement network, thus obtaining a visual display of the input sequence. The fragment selection and extraction module extracts the characterization of recommended fragments based on the recommended fragment information, scores these characterizations, and obtains the scoring results of recommended fragments. The attention enhancement network stores the protein characterizations generated at each layer and passes them to the classification predictor module as hierarchical characterization of the protein sequence. The classification predictor module obtains the protein subcellular localization classification results based on the overall sequence characterization, including: conserved regions, signal peptide fragments, nuclear localization signals, and protein binding sites.

[0006] This invention relates to a protein sequence classification and localization method based on attention enhancement networks. The method involves preprocessing the original sequence to accommodate the input length requirements of a large model, obtaining a vector space representation of each amino acid in the protein sequence using a large language model of the protein sequence as a preliminary sequence representation, using an attention enhancement network to progressively enhance the amino acid representations at different positions, and integrating all amino acid representations to obtain the overall sequence representation, generating a sequence attention distribution. This is then visualized to obtain a hierarchical sequence attention distribution map. A classification predictor module is used to predict the subcellular localization of the protein. An important fragment extractor is used to obtain important fragment recommendations, and a fragment selection and extraction module is used to score and rank the recommended important fragments.

[0007] Technical effect

[0008] This invention uses an attention enhancement network to feed back the attention distribution to derive model predictions of protein subcellular localization. A fragment generation and delivery system then inputs the attention distribution into an important fragment extractor to obtain important fragment recommendations. A fragment selection and extraction module scores and ranks the recommended fragments, thus simultaneously predicting subcellular localization and identifying sequence fragments important for that localization. Compared to existing technologies, this invention can obtain fragments important for subcellular localization while predicting it. The recommendation system trained on a nuclear localization signal dataset can predict nuclear localization signals and also mine potential nuclear localization signals. Attached Figure Description

[0009] Figure 1 This is a schematic diagram of the system of the present invention;

[0010] Figure 2 A schematic diagram of an attention enhancement network;

[0011] Figure 3 This is a schematic diagram of a basic attention unit;

[0012] Figure 4 A schematic diagram of the important site visualization module;

[0013] Figure 5 This is a schematic diagram of an important fragment extractor.

[0014] Figure 6 This is a flowchart of an implementation example;

[0015] Figure 7 This is a schematic diagram illustrating the performance of DANLS on the INSP hybrid dataset.

[0016] Figure 8 This is a schematic diagram of the fragment filtering and extraction module. Detailed Implementation

[0017] like Figure 1 As shown, this embodiment relates to a protein sequence classification and localization system based on an attention enhancement network, which includes: a sequence preprocessing module, a sequence characterization embedding module, an attention enhancement network, an important fragment extractor, an important site visualization module, a fragment screening and extraction module, and a classification predictor module.

[0018] The classification predictor module described above adopts an MLP, LSTM+MLP, or transformer+MLP structure.

[0019] like Figure 4 As shown, the important site visualization module includes: a row attention extraction unit, an attention synthesis unit, and a visualization representation unit, wherein: the row attention extraction unit extracts each deep attention layer a in each attention enhancement network. tlInformation, summing the attention of all layers, yields the combined attention result a of all layers. cT =∑ L a tlL Where: L represents the layer number, and the visualization representation unit is based on the attention synthesis result of all layers. cT Information, to conduct a cT Visualization processing yields visualization results of important sites – a protein sequence attention distribution hierarchy map.

[0020] like Figure 5 As shown, the important fragment extractor includes: an attention extraction unit, a weight sorting unit, and a fusion judgment unit, wherein: the attention extraction unit calculates the comprehensive attention a from the fragment generation transmitter. cT , will integrate attention a cT The value is passed to the weight sorting unit, which then sorts the values ​​according to the comprehensive attention a. cT The attention weights of all points are sorted according to the magnitude of the comprehensive attention weight of each site. The top N points are selected as the pre-selected key points set. The fusion judgment unit traverses within the pre-selected key points set and determines whether to fuse two points and the amino acids between them based on whether the distance between two key points is less than L. Finally, the set of important fragments after the key points are fused is obtained, which is the important fragment recommendation.

[0021] N is a pre-defined hyperparameter.

[0022] like Figure 8 As shown, the segment selection and extraction module includes: an important segment representation extraction unit and an MLP predictor unit. The important segment representation extraction unit strengthens and extracts the representation of important segment recommendations through an attention enhancement network and outputs it to the MLP predictor unit. The MLP predictor unit obtains the recommendation score of each segment and ranks them according to the score, thereby obtaining a set of recommended segments ranked by recommendation degree.

[0023] Depending on the downstream task, key segments can be locally extracted for further prediction and classification.

[0024] like Figure 6 As shown, this embodiment illustrates a protein sequence classification and localization method based on the attention enhancement network of the aforementioned system. For a protein sequence of length l, S... l Its amino acid composition is represented by uppercase letters such as M and A, for example, S. l =MLLLLLLLLLLPPLVLRVAA…, the method specifically includes:

[0025] Step 1) Preprocess the original sequence to match the input requirements of the large language model. Specifically, perform concatenation or proportional selection based on the length of the original sequence to ensure that the sequence length matches the requirement that the large language model's length is less than or equal to 1024. Among them, cut() and split() represent the proportional selection and splicing of the sequence, respectively.

[0026] The splicing process mentioned above refers to: when the sequence length l is greater than 2048, the sequence S is spliced... l Divide the data into multiple segments according to a length of 1024, resulting in a set of segments {S}. l1 S l2 ...}

[0027] The aforementioned proportional selection process refers to: when the sequence length l is greater than 1024 and less than 2048, for S... l The selection is carried out proportionally, specifically: select a 1024×25% length sequence starting from the head, a 1024×25% length sequence starting from the tail, and a 1024×50% length sequence starting from the middle. Finally, the selected sequences are spliced ​​together in the order of head, middle, and tail to obtain the final protein sequence with a length of 1024.

[0028] Step 2) Embed the sequence representation using an evolutionary scalar model (ESM) 1b based on a transformer architecture. This embedding allows the representation to contain the knowledge learned by the large model during massive pre-training, and transforms the sequence into a preliminary representation in matrix form. Specifically, this involves generating a representation {E} for each segment. l1 E l2 ...}={LAM(S l1 ), LAM(S l2 ...}, thus obtaining the final representation of the sequence E = Concat(E l1 E l2 …), where: Concat() means concatenating the generated vector matrix according to the sequence length dimension, and the final shape of the protein representation E is l×H, where: H is the dimension of the big language model representation embedding, and l is the length of the protein sequence input to the big language model.

[0029] The large language model described above employs a supervised learning approach, utilizing the sequences in the training set and the labels of the sequence pairs (e.g., located at different subcellular sites) to update network parameters through ladder backpropagation and optimization algorithms (e.g., Adam).

[0030] Preferably, in this embodiment, the parameters of the language model are kept fixed to save computing resources while ensuring good performance of downstream tasks. Users can choose whether to update the language model together with the actual computing resources available.

[0031] Step 3) The sequence attention distribution is generated by the initial characterization of the sequence through the attention enhancement network, and the sequence attention distribution hierarchy map is obtained through visualization processing.

[0032] like Figure 2 As shown, the attention enhancement network includes: a deep attention layer, a layer residual connection network, a classifier information transmission module, and a fragment generation transmitter, wherein: the deep attention layer obtains the attention enhancement representation E based on the input overall protein representation. L Layered residual connection networks enhance attention to the representation E. L Adding the sequence representation vector representation to the output E of each layer yields the output E. OL = (E+E) L The classifier information transmission module will transmit vector q. c With the output E of the last deep attention layer O~final The amino acid characterization at each position is multiplied to obtain the final attention distribution a. final The weights at each position are generated to obtain the overall representation of the sequence. The fragment generator extracts the attention synthesis result of all layers, a. cT , where: 1 is a column vector with all elements being 1.

[0033] like Figure 2 As shown, the deep attention layer includes: several basic attention units and a layer neuron information synthesizer, wherein: the row attention distribution of the i-th basic attention unit is a. li The information layer neuron information synthesizer adds the row attention distributions of each basic attention unit to obtain the comprehensive attention distribution a. tl =∑ i a li All basic attention units share a length projection vector representation of v. l With column attention representation v a Based on this, a total attention enhancement matrix A is formed. r =(v l ⊙v a )*a tl The attention enhancement matrix is ​​multiplied by the Hadamard product with the sequence's representation vector matrix E to obtain the layer's attention enhancement representation E. L =(E⊙A r ).

[0034] like Figure 3As shown, the basic attention unit includes row attention units and column attention units, wherein: the row attention unit q forms the attention distribution a for each amino acid by matrix multiplication with the sequence representation vector matrix E. l =E T q, and based on the representation vector matrix E and the judgment vector formed based on attention focus. Where: 1 represents a column vector with all elements equal to 1; the column attention unit learns a column attention representation v with the same dimension as the vector. a The distribution of attentional focus during the formation of attention (a) l Based on this, an attention enhancement matrix A is formed. r =v a *a l Through attention enhancement matrix A r Performing the Hadamard product with the representation vector matrix E of the sequence yields the enhanced representation E after the basic attention unit. r =A r ⊙E.

[0035] Each column in the characterization vector matrix E represents the amino acid characterization at the corresponding position.

[0036] The classifier information transmission module includes: a vector representation extraction unit, an MLP vector dimension transformation unit, and a concatenation unit. The vector representation extraction unit extracts the vector representation formed by each basic attention element, and concatenates all vectors from each layer to form a single vector. This vector is the overall judgment vector v for each attention layer's protein representation. t The MLP vector dimension transformation unit will transform the overall judgment vector v t After performing a dimensional transformation to the specified dimension, the concatenation unit further concatenates the overall judgment vectors from all different levels to form the final judgment vector v. ut To ensure that the classification predictor module can see both deep-level information and information from the initial layer, specifically: the judgment vector v ut =concat(v t1 v t2 v t3 ....).

[0037] Step 4) Use all experimentally validated protein sequences located in the cell nucleus from the SWISS-PROT database as positive samples and randomly select the same number of other protein sequences not located in the cell nucleus as negative samples to train the attention enhancement network in Step 3. The training process uses the mean squared error between the probability prediction result of the protein sequence being located in the cell nucleus and the label as the loss function, and updates the network parameters through gradient backpropagation.

[0038] The training process preferably uses a learning rate of 0.1 as the base, which is repeatedly multiplied by 0.1. Different input sequences are used to represent the input attention enhancement network, which obtains the prediction results through the classification predictor module. The loss function is calculated using the labels of the training set sequences and the prediction results.

[0039] Through specific practical experiments, under the specific environment settings of INSP, training the model with the same training set, the experimental data obtained are as follows: Figure 7 The table below shows the results of comparing accuracy, recall, and F1 score with other predictors in the domain.

[0040] Table 1

[0041]

[0042] Compared with existing technologies, this method achieves a recall rate of 0.99, surpassing the previous best result of PSORTII (0.678) in the field. It also achieves an accuracy of 0.64, ranking second in the field, with a difference of 0.027 from the best result of 0.667. The F1 score reaches 0.78, which is an improvement of 0.203 compared with the previous best result of INSP (0.577).

[0043] The above-described specific implementations can be partially adjusted by those skilled in the art in different ways without departing from the principles and purpose of the present invention. The scope of protection of the present invention is defined by the claims and is not limited to the above-described specific implementations. All implementation schemes within the scope of the claims are bound by the present invention.

Claims

1. A protein sequence classification and localization system based on attention enhancement networks, characterized in that, include: The system comprises a sequence preprocessing module, a sequence representation embedding module, an attention enhancement network, an important fragment extractor, an important site visualization module, a fragment selection and extraction module, and a classification predictor module. Specifically: the sequence preprocessing module segments sequences whose length exceeds the longest input of the pre-trained model based on sequence length information, obtaining sequence fragments that conform to the input of the pre-trained model; the sequence representation embedding module uses the protein pre-trained model esm1b to embed the sequence fragments into representations based on the input sequence fragment information, obtaining a vector representation matrix of the sequence fragments; the attention enhancement network multiplies the protein fragment representation according to the attention distribution generated by multiplying the representation by the attention mechanism and the representation, obtaining an attention-adjusted protein fragment representation. The important fragment extractor selects sites with high distribution weights based on the attention distribution information obtained from the attention enhancement network for fusion processing, resulting in recommended fragments. The important site visualization module visualizes the attention weight of each amino acid at each site as the height of the amino acid letter, based on the attention distribution obtained from the attention enhancement network, thus providing a visual display of the input sequence. The fragment selection and extraction module extracts and represents recommended fragments based on the recommended fragment information, scores these representations, and obtains the recommended fragment scores. The attention enhancement network stores the protein representations generated at each layer and passes them to the classification predictor module as a hierarchical representation of the protein sequence. The classification predictor module obtains the protein subcellular localization classification results based on the overall sequence representation. The attention enhancement network includes: a deep attention layer, a layer residual connection network, a classifier information transmission module, and a fragment generation transmitter, wherein: the deep attention layer obtains the layer attention enhancement representation based on the overall protein representation input. Layer residual connection networks enhance layer attention representation. Adding the sequence representation vector representation to obtain the output of each layer. The classifier information transmission module will transmit the vector With the output of the last deep attention layer The amino acid characterization at each position is multiplied to obtain the final attention distribution. The weights at each position are generated to obtain the overall representation of the sequence. The fragment generator extracts the attention synthesis result from all layers. , where: 1 is a column vector with all elements being 1; The deep attention layer comprises: several basic attention units and a layer neuron information synthesizer, wherein: the row attention distribution of the i-th basic attention unit is as follows: The information layer neuron information synthesizer adds the row attention distributions of each basic attention unit to obtain a comprehensive attention distribution. All basic attention units share a length projection vector representation. With column attention representation Based on this, a total attention enhancement matrix is ​​formed. The attention enhancement matrix is ​​obtained by taking the Hadamard product with the representation vector matrix E of the sequence to obtain the layer attention enhancement representation. ; The basic attention unit includes: row attention unit and column attention unit, wherein: row attention unit The attention distribution for each amino acid is obtained by matrix multiplication with the sequence's representation vector matrix E. = And based on the representation vector matrix E and the judgment vector formed based on attention focus, Where: 1 represents a column vector with all elements equal to 1; the column attention unit learns a column attention representation with the same dimension as the vector. Distribution of attention span during attention formation Based on this, an attention enhancement matrix is ​​formed. Through attention enhancement matrix The Hadamard product of the sequence representation vector matrix E yields the enhanced representation after basic attention units. ; The classifier information transmission module includes: a vector representation extraction unit, an MLP vector dimension transformation unit, and a concatenation unit. The vector representation extraction unit extracts the vector representation formed by each basic attention element, and concatenates all vectors from each layer to form a single vector. This vector represents the overall judgment vector for each attention layer's protein representation. , The MLP vector dimension transformation unit will determine the overall vector. After performing a dimensional transformation to the specified dimension, the concatenation unit further concatenates the overall judgment vectors from all different levels to form the final judgment vector. To ensure that the classification predictor module can see both deep-level information and information from the initial layer, specifically: the judgment vector... .

2. The protein sequence classification and localization system based on attention enhancement networks according to claim 1, characterized in that, The important site visualization module includes: a row attention extraction unit, an attention synthesis unit, and a visualization representation unit, wherein: the row attention extraction unit extracts each deep attention layer in each attention enhancement network. The information is summed by adding the attention values ​​from all layers to obtain the combined attention result from all layers. Where: L represents the layer number, and the visualization representation unit is based on the combined attention result of all layers. Information, to conduct Visualization processing yields visualization results of important sites – a protein sequence attention distribution hierarchy map.

3. The protein sequence classification and localization system based on attention enhancement networks according to claim 1, characterized in that, The important fragment extractor includes: an attention extraction unit, a weight sorting unit, and a fusion judgment unit, wherein: the attention extraction unit calculates the comprehensive attention of the fragment generation transmitter. , will integrate attention The data is passed to the weight sorting unit, which then sorts the data based on the comprehensive attention. The attention weights of all points are sorted according to the magnitude of the comprehensive attention weight of each site. The top N points are selected as the pre-selected key point set. The fusion judgment unit traverses within the pre-selected key point set and determines whether to fuse the two points and the amino acids between them based on whether the distance between two key points is less than L. Finally, the set of important fragments after the key points are fused is obtained, which is the fragment that is more important to the prediction result.

4. The protein sequence classification and localization system based on attention enhancement networks according to claim 1, characterized in that, The segment selection and extraction module includes an important segment representation extraction unit and an MLP predictor unit. The important segment representation extraction unit extracts the representations of segments that are important to the final predictor classifier generated by the important segment extractor after being enhanced by the attention enhancement network. The important segment representation extraction unit is then input into the MLP predictor unit. The MLP predictor unit obtains the recommendation score for each segment and ranks them according to the score, thus obtaining a set of recommended segments ranked by recommendation degree.

5. A protein sequence classification and localization method based on an attention-enhancing network of any one of the systems described in claims 1-4, comprising: preprocessing the original sequence to adapt to the input length requirements of a large model; obtaining the vector space representation of each amino acid in the protein sequence as a preliminary sequence representation through a large language model of the protein sequence; using an attention-enhancing network to enhance the representation of amino acids at different positions layer by layer and integrating all amino acid representations to obtain the overall sequence representation, generating a sequence attention distribution, and obtaining a sequence attention distribution hierarchy map through visualization; using a classification predictor module to predict the subcellular localization of the protein; using an important fragment extractor to generate sequence fragments that are more important to the prediction results using the attention obtained from the attention-enhancing network, and using a fragment screening and extraction module to score and sort to obtain the sequence fragments that are important to the prediction results.