A method for predicting lncrna-protein interaction relationship based on clustering

By using a cluster-based prediction method, segmented k-mer encoding and joint prediction by multiple cluster centers, the problem of low prediction accuracy caused by positive and negative sample imbalance in existing technologies is solved, and higher accuracy and performance in predicting lncRNA-protein interactions are achieved.

CN117095738BActive Publication Date: 2026-06-23ANHUI UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
ANHUI UNIV
Filing Date
2023-08-18
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing methods for predicting lncRNA-protein interactions generally employ randomization or computational scoring to address the imbalance between positive and negative samples, resulting in low prediction accuracy and poor performance.

Method used

A cluster-based prediction method is adopted, which obtains the features of lncRNA and protein through segmented k-mer encoding, and uses multiple cluster centers for joint prediction. Independent autoencoders are used to map different features of the samples to different feature spaces, and multiple feature spaces are used to jointly constrain the classification of lncRNA-protein pairs.

Benefits of technology

It improves the accuracy of lncRNA-protein interaction prediction, better uncovers potential interaction pairs, adapts to the actual situation of imbalanced data, avoids the loss of key information, and improves the model's adaptability and prediction performance.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN117095738B_ABST
    Figure CN117095738B_ABST
Patent Text Reader

Abstract

The application discloses a kind of based on clustering lncRNA-protein interaction relationship prediction method, belong to bioinformatics technical field.The method includes the following steps: obtaining the data set of lncRNA and protein to be predicted, using segmented k-mer coding obtains the characteristics of lncRNA and protein, the characteristics of lncRNA and protein are input into the lncRNA-protein interaction relationship prediction model trained, obtain interaction relationship prediction result.Affirmative effect: solve the problem of low LPIs prediction accuracy under the condition of positive and negative sample imbalance, propose based on clustering lncRNA-protein interaction relationship prediction method.Experimental results show that, compared with other methods, our method has achieved better effect on multiple data sets.At the same time, our model is more skilled in mining potential lncRNA-protein interaction pair, which makes researchers less likely to miss possible lncRNA-protein interaction, the prediction accuracy is higher, and the prediction effect is better.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of bioinformatics technology, specifically to a cluster-based method for predicting lncRNA-protein interactions. Background Technology

[0002] Noncoding RNAs (ncRNAs) are RNAs that do not encode proteins, and typically include small nuclear RNAs (snRNAs), small nucleolar RNAs (snoRNAs), microRNAs (miRNAs), and long non-coding RNAs (lncRNAs).

[0003] Long noncoding RNAs (lncRNAs) are ncRNAs longer than 200 nucleotides. LncRNAs regulate gene expression by influencing chromatin modification, transcriptional regulation, and post-transcriptional regulation; they are also associated with diseases such as cancer. As one way lncRNAs exert their biological effects, studying the interactions between lncRNAs and proteins can better explore the roles of lncRNAs in biological processes.

[0004] To study the interactions between lncRNAs and proteins (lncRNA-protein interactions, LPIs), numerous experiments have been conducted, including artificial methods such as RNA comparison, RIP-Chip, HITS-CLIP, and PAR-CLIP. These experiments are extremely time-consuming and expensive. Therefore, many more convenient and efficient computational methods have been proposed.

[0005] The LPI prediction problem can be viewed as a classification problem: given a lncRNA feature and a protein feature, determine whether their feature combinations interact. Earlier, several methods based on traditional machine learning emerged for predicting lncRNA-protein interactions. Bellucci et al. proposed the CatRAPID method, which uses the physicochemical properties of RNA and proteins, such as secondary structure and hydrogen bonds, to assess the interaction tendencies between lncRNAs and proteins. Muppirala et al. proposed a model called RPIseq, which uses only the sequence information of lncRNAs and proteins, and then feeds it into support vector machines and random forests for classification. Lu et al. proposed the lncPro method, which encodes RNA and protein sequences into numerical vectors and uses matrix multiplication to score lncRNA-protein pairs; the scores are used to judge interactions.

[0006] With the development of computer technology, using deep learning to solve the problem of lncRNA-protein interaction (LPI) prediction is becoming increasingly popular. Traditional machine learning methods rely on manually extracted features, which directly affects the performance of the model. Deep learning, on the other hand, does not require manual feature construction; it can automatically extract hidden, abstract features based on the input-output relationship of the model. IPMiner, proposed by Pan et al., is a classic LPI prediction method. IPMiner uses stacked autoencoders to mine hidden interaction patterns from the sequence features of proteins and RNAs, and then uses ensemble learning to combine multiple classifiers. PLRPIM, proposed by Wekesa et al., is a method combining deep learning and machine learning for plant LPI prediction. It uses stacked sparse autoencoders to extract high-level abstract features based on sequences, and uses a fusion method of random forest (RF) and optical gradient enhancement machine (LGBM) to build a prediction model. Zhang et al. proposed the method LPI-CNNCP, which uses replication padding techniques to convert variable-length lncRNA-protein sequences into fixed-length sequences, then uses high-order one-hot encoding to encode the lncRNA or protein sequences, and finally uses CNN to predict whether there is an interaction between lncRNA and protein pairs. Huang et al. proposed LGFC-CNN, which uses a CNN model and designs global, local, and handcrafted features based on sequence information. Furthermore, LGFC-CNN utilizes protein-protein similarity to calculate the interaction score between lncRNAs and proteins to select appropriate negative samples. Shen et al. proposed NPI-GNN, which uses graph neural networks to handle LPI prediction problems, employing node2vec features and k-mer features, and integrating GraphSAGE and top-kpooling on the SEAL framework to construct the GNN model.

[0007] For deep learning models, dataset selection is crucial. For example, LPI-BLS, NPI-GNN, PREI-SC, and PLRPIM randomly select unverified interacting lncRNAs and proteins to construct a negative sample set with the same number of positive samples. LPI-CNNCP and LGFC-CNN, on the other hand, select the same number of positive and negative samples based on similarity scores. Typically, in LPI prediction problems, the distribution of positive and negative samples is unbalanced, with the number of negative samples often far exceeding the number of positive samples. Ignoring or discarding some negative samples can lead to the loss of crucial information, affecting the model's learning performance. Manually selecting samples, however, can cause the model to perform poorly in more general situations.

[0008] Chinese patent application CN115547407A discloses a method for predicting lncRNA-protein interactions based on a deep autoencoder. This method obtains initial features of the lncRNA and protein to be predicted, and then inputs these initial features into a trained lncRNA-protein interaction prediction model to obtain the interaction prediction results. The patent utilizes marginal Fisher analysis to learn the optimal classification features of lncRNA-protein interaction samples, improving the accuracy of lncRNA-protein interaction prediction. However, this patent does not address how to improve prediction accuracy in cases of imbalanced positive and negative samples. Summary of the Invention

[0009] The technical problem to be solved by this invention is how to address the issue that existing lncRNA-protein interaction prediction methods generally adopt random or score-based methods, which do not use or discard a portion of negative samples to handle the imbalance between positive and negative samples, resulting in low prediction accuracy and poor prediction performance of lncRNA-protein interactions.

[0010] The present invention solves the above-mentioned technical problems through the following technical means:

[0011] A cluster-based method for predicting lncRNA-protein interactions includes the following steps:

[0012] Obtain the dataset of lncRNAs and proteins to be predicted, use segmented k-mer encoding to obtain the features of lncRNAs and proteins, input the features of lncRNAs and proteins into the trained lncRNA-protein interaction prediction model, and obtain the interaction prediction results.

[0013] The trained lncRNA-protein interaction prediction model was obtained through the following methods:

[0014] (1) Pre-training process:

[0015] S1: Obtain a dataset containing lncRNA sequences, protein sequences, and the interaction relationships between lncRNA and protein sequences;

[0016] S2: Segment the lncRNA sequence and protein sequence into k-mer encodings to obtain m sets of features;

[0017] Explanation: A lncRNA-protein sample pair is represented by m sets of features, including m-1 local features and one global feature;

[0018] S3: Input m sets of features into m autoencoders respectively, calculate the deviation between the input and output of the autoencoders as the loss of the pre-training process, and continuously train to gradually reduce the deviation. Finally, the pre-training ends and the pre-trained autoencoder model is obtained.

[0019] The m autoencoders are independent of each other, and the loss of each autoencoder is calculated separately, as shown in the following formula:

[0020]

[0021] Where x i Indicates the encoder input, x o This represents the output of the decoder, where d represents x. i The dimension is n, where n is the number of samples;

[0022] (2) Training process:

[0023] S4: Calculate cluster centers using the pre-trained autoencoder model from step S3:

[0024] Each of the m features of each sample in the dataset is input into m autoencoders. If there are n samples in the dataset, each autoencoder will produce n output vectors. The average of these vectors is used as the cluster center, resulting in a total of m cluster centers.

[0025] S5: Discard the decoder part of the pre-trained autoencoder model and keep only the encoder part;

[0026] Explanation: Specifically, m encoder models are reinitialized, and the parameters of the encoder parts of the m autoencoders are transferred to the initialized encoder models. Then, these encoders are trained on the basis of the pre-trained parameter weights, and the structure of these encoders is consistent with the encoder structure in the pre-training process.

[0027] S6: Input the m sets of features of each sample into m encoders to obtain m output vectors; calculate the deviation values ​​between the m output vectors and the corresponding m cluster centers, and use them as the training loss. The formula for calculating the training loss of the j-th encoder among the m encoders is as follows:

[0028]

[0029] Where n represents the number of training samples, x ij c represents the output of the i-th sample in the j-th autoencoder. j Let y represent the j-th cluster center. iThis is the calculated weight of the i-th sample, with 1 for samples without interaction and -1 for samples with interaction; each loss is calculated independently.

[0030] Explanation: The purpose of the loss function during training is to continuously decrease the distance between non-interacting samples and cluster centers, and continuously increase the distance between interacting samples and cluster centers, thereby achieving clustering. The training processes of different encoders are independent of each other. (Here, distance refers to the deviation between the output feature vector and the cluster centers.)

[0031] (3) Prediction process:

[0032] S7: The features of lncRNA and protein are concatenated and fed into m trained encoders to obtain m output vectors;

[0033] S8: Calculate the distance between the m output vectors and the m cluster centers to obtain m predicted distance values; 3. Sum these m predicted distance values ​​to obtain the final predicted distance (i.e., the total predicted distance); For a lncRNA-protein sample, the formula for calculating the total distance is as follows:

[0034]

[0035] Where x ij c represents the output feature of the i-th sample in the j-th encoder. j This represents the j-th cluster center;

[0036] S9: Compare the total predicted distance with the cluster radius. If the total predicted distance is greater than the cluster radius, it is classified as having an interaction; if the total predicted distance is less than the cluster radius, it is classified as having no interaction.

[0037] Beneficial Effects: This invention introduces segmented k-mer encoding to obtain features with both global and local characteristics. Furthermore, we employ a method of joint prediction using multiple cluster centers, mapping different features of a sample to different feature spaces, and utilizing multiple feature spaces to jointly constrain the classification of lncRNA-protein pairs. Experimental results show that our method achieves better results on multiple datasets compared to other methods. Simultaneously, our model is better at uncovering potential lncRNA-protein interaction pairs, while achieving higher prediction accuracy and better prediction performance.

[0038] Preferably, the segmented k-mer encodings all include encodings of global and local sequences, so that the feature representations of lncRNAs and proteins have both local and global characteristics.

[0039] Preferably, the segmented k-mer encoding is as follows: a sequence is divided into m equal segments, and a k-mer encoding is calculated for each segment; the segments are named k-mer(segment 1), k-mer(segment 2)...k-mer(segment m-1) in order, and the global k-mer encoding is represented by k-mer(global).

[0040] Preferably, in step S1, the sequence information of lncRNA is obtained from the Gencode v29 database, the sequence information of protein is obtained from the UniProt database, and the interaction relationship between lncRNA and protein is obtained from the NPInter v3.0 database.

[0041] Preferably, the pre-training process in step (1) uses the Adam optimizer, and the batch size, learning rate, dropout parameters and number of pre-training generations are set as needed.

[0042] Preferably, the pre-training process in step (1) uses the Adam optimizer, sets the batch size to 64, the learning rate to 0.001, the dropout to 0.2, and the number of pre-training generations to 15.

[0043] Preferably, the pre-training process in step (2) uses the Adam optimizer, sets the batch size, learning rate, dropout parameters and number of training generations as needed, and uses a learning rate reduction strategy.

[0044] Preferably, the training process in step (2) uses the Adam optimizer, sets the batch size to 64, the dropout to 0.2, sets the learning rate to 0.001, and uses a learning rate reduction strategy to make the training process of the model more stable. The number of training generations is set to 20, and the learning rate is reduced to 0.1 times the previous value at generations 15 and 18.

[0045] Preferably, the autoencoder in step S3 includes an encoder part and a decoder part; the encoder part includes three layers of one-dimensional convolutional neural networks and three layers of fully connected layers, wherein the kernel sizes of the convolutional neural networks are 6, 6, and 5, respectively; the number of channels are 16, 64, and 16, respectively; and the number of neurons in the fully connected layers are 256, 128, and 64, respectively; the decoder part includes three layers of one-dimensional transposed convolutional networks and three layers of fully connected layers, wherein the kernel sizes of the three layers of one-dimensional transposed convolutional networks are 5, 6, and 6, respectively; the number of channels are 64, 16, and 1, respectively; and the number of neurons in the fully connected layers are 128, 256, and 2272, respectively.

[0046] Preferably, the clustering radius in step S9 is 0.5.

[0047] The advantages of this invention are:

[0048] 1. This invention proposes a cluster-based method for predicting lncRNA-protein interactions. Using complete negative sample data means that no crucial information will be mistakenly discarded. This also better reflects the actual sample distribution of LPIs data in real-world scenarios.

[0049] 2. This invention introduces segmented k-mer encoding to obtain features with both global and local characteristics. Furthermore, we employ a method of joint prediction using multiple cluster centers to map different features of a sample to different feature spaces, utilizing these multiple feature spaces to jointly constrain the classification of lncRNA-protein pairs. Experimental results show that, compared to other methods, our method achieves better results on multiple datasets with higher prediction accuracy. Simultaneously, the experimental results also demonstrate that our model is better at uncovering potential lncRNA-protein interaction pairs, making it less likely for researchers to miss them.

[0050] 3. The prediction method of this invention has strong adaptability to data. We expect that the prediction method of this invention will play a beneficial role in future tasks of predicting lncRNA-protein interactions, and we also hope that our method can be applied to more similar fields. Attached Figure Description

[0051] Figure 1 This is a schematic diagram of the segmented k-mer feature encoding method of the present invention;

[0052] Figure 2 This is a schematic diagram of the automatic encoder structure of the present invention;

[0053] Figure 3 This is a network structure diagram of the encoder portion of the automatic encoder of the present invention.

[0054] Figure 4 This is a schematic diagram of the prediction process of the present invention;

[0055] Figure 5 This is a graph showing the ratio of positive to negative experimental predictions in this invention. Detailed Implementation

[0056] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below in conjunction with the embodiments of the present invention. Obviously, the described embodiments are only some embodiments of the present invention, not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0057] Example 1:

[0058] Obtain a trained lncRNA-protein interaction prediction model:

[0059] (1) Pre-training process:

[0060] 1. First, we used the RPI7317 public dataset collected by Fan and Zhang[1], which is a dataset about human lncRNA and protein interactions. The interaction between lncRNA and protein can be obtained from the NPInter v3.0 database (the NPInter v3.0 database is derived from the RPI7317 public dataset). The sequence information of lncRNA comes from the Gencodev29 database, and the sequence information of protein comes from the UniProt database. ([1]Fan XN,Zhang SW.LPI-BLS:Predicting lncRNA–protein interactions with a broad learning system-based stacked ensemble classifier.Neurocomputing.2019Dec 22;370:88-93)

[0061] The information for the RPI7317 dataset is shown in the table below. The number of positive samples with interactions is 7317, and the number of negative samples without interactions is 213815.

[0062] dataset lncRNAs proteins positive samples negative samples RPI7317 1874 118 7317 213815

[0063] Table 1

[0064] 2. Our method uses the complete set of positive and negative samples. Furthermore, the dataset is divided into training, validation, and test sets in a 7:1:2 ratio, and 5-fold cross-validation is implemented during the experiment.

[0065] 3. Apply a novel segmented k-mer encoding to lncRNA and protein sequences, such as... Figure 1 As shown, segmented k-mer encoding calculates k-mer codes for both the complete sequence and the segmented sequences, and then concatenates the corresponding segmented k-mer codes for the lncRNA and protein. Segmented k-mer encoding allows the feature encoding to include both global and local features of the sequence. Using segmented k-mer encoding, m sets of features are ultimately obtained to represent a lncRNA-protein sample pair. In our method, m = 6.

[0066] Specifically:

[0067] RNA contains four bases: A, U, G, and C. Based on the dipole moment and side chain volume of the amino acids, the 20 amino acids that make up proteins are divided into seven groups, as shown in Table 2 below.

[0068] Groups Amino acids <![CDATA[S1]]> {A,G,V} <![CDATA[S2]]> {I,L,F,P} <![CDATA[S3]]> {Y,M,T,S} <![CDATA[S4]]> {H,N,Q,W} <![CDATA[S5]]> {R,K} <![CDATA[S6]]> {D,E} <![CDATA[S7]]> {C}

[0069] Table 2

[0070] Segmented k-mer encoding divides a sequence into m equal segments and calculates the k-mer code for each segment. For lncRNAs, k=4 is set, meaning the frequencies of (AAAA, AAAU, AAAAG...CCCC) are calculated separately, resulting in a vector of length 256. The segmented encoding results are named k-mer(segment 1), k-mer(segment 2),...k-mer(segment m-1) in sequence, with the global k-mer code represented by k-mer(global). For proteins, k=3 is set, resulting in a vector of length 343. Then, the features of the corresponding segments of lncRNAs and proteins are concatenated, resulting in a vector of length 599. A total of m features represent a lncRNA-protein pair, including m-1 local features and one global feature.

[0071] 4. Corresponding to the number of segmented coding features, the number of autoencoders is also set to m. The m sets of features are input into the m autoencoders, which are independent of each other. The autoencoder structure is as follows: Figure 2 As shown, it consists of two parts: an encoder and a decoder. The autoencoder first compresses the dimensions of the input features and then restores them to their original dimensions. This process is equivalent to transforming the original features into a feature space.

[0072] Specifically:

[0073] The pre-training process uses the Adam optimizer, with a batch size of 64, a learning rate of 0.001, dropout of 0.2, and 15 pre-training generations.

[0074] The structure of the encoder is as follows Figure 3 As shown, the encoder mainly consists of three 1D convolutional neural network layers and three fully connected layers. The kernel sizes of the convolutional neural networks are 6, 6, and 5, and the number of channels are 16, 64, and 16, respectively. The number of neurons in the fully connected layers are 256, 128, and 64, respectively. The decoder structure is similar to the encoder, containing three 1D transposed convolutional layers and three fully connected layers. The kernel sizes of the three 1D transposed convolutional layers are 5, 6, and 6, and the number of channels are 64, 16, and 1, respectively. The number of neurons in the fully connected layers are 128, 256, and 2272, respectively.

[0075] 5. Calculate the deviation between the input and output of the autoencoder and continuously train the model to gradually reduce the deviation. Finally, the pre-training is completed, and the pre-trained model is obtained. There are m autoencoders, and the loss of each autoencoder is calculated separately, using the following formula:

[0076]

[0077] Indicates the encoder input, x o This represents the output of the decoder, where d represents x. i The dimension of x is n, where n is the number of samples. i

[0078] (2) The training process steps are as follows:

[0079] 1. Calculate cluster centers using a pre-trained autoencoder model.

[0080] Specifically, the m features of each sample in the training set are input into m autoencoders. Assuming there are n samples in the training set, each autoencoder will produce n output vectors. The average of these vectors is used as the cluster center, resulting in a total of m cluster centers.

[0081] 2. Discard the decoder part of the pre-trained autoencoder model and keep only the encoder part.

[0082] Specifically, we reinitialize m encoder models, transfer the parameters of the encoder parts of the m autoencoders to the initialized encoder models, and then these encoders continue to be trained based on the pre-trained parameter weights. At the same time, the structure of these encoders is consistent with the encoder structure in the pre-training process.

[0083] 3. A novel method for joint prediction using multiple cluster centers is applied. Each sample's m sets of features are input into m encoders, resulting in m output vectors. The encoders transform the dimensionality of the features, mapping them to different feature spaces. We utilize multiple encoders to map different features of a sample to different feature spaces, and then use multiple cluster centers for joint prediction of a single sample.

[0084] Specifically: The training process uses the Adam optimizer, with a batch size of 64 and dropout set to 0.2. The learning rate is set to 0.001, and a learning rate reduction strategy is used to make the model training process more stable. The number of training generations is set to 20, and the learning rate is reduced to 0.1 times the previous value at generations 15 and 18.

[0085] 4. Calculate the loss using each of the m output vectors and their corresponding m cluster centers. The formula for calculating the training loss of the j-th encoder among the m encoders is as follows:

[0086]

[0087] Where n represents the number of training samples, x ij c represents the output of the i-th sample in the j-th autoencoder. j Let y represent the j-th cluster center. i This is the calculated weight of the i-th sample, with 1 for samples without interaction and -1 for samples with interaction. The calculation of each loss is independent.

[0088] 5. The purpose of the loss function during training is to continuously decrease the distance between non-interacting samples and cluster centers, and continuously increase the distance between interacting samples and cluster centers, thereby achieving clustering. The training processes of different encoders are independent of each other.

[0089] (3) Prediction process steps:

[0090] 1. The prediction process can use, for example, Figure 4 As shown, the segmented k-mer features of the previously calculated samples are first used, and after concatenation, they are fed into multiple m trained encoders to obtain m output vectors.

[0091] 2. Calculate the distance between each of the m output features and the m cluster centers to obtain m predicted distance values.

[0092] 3. Sum these m distances to obtain the final predicted distance. For a single lncRNA-protein sample, the formula for calculating the total distance is as follows:

[0093]

[0094] Where x ij c represents the output feature of the i-th sample in the j-th encoder. j Let j represent the j-th cluster center.

[0095] 4. Compare the total predicted distance with the cluster radius. If the total predicted distance is greater than the cluster radius, the cluster is classified as having an interaction. If the total predicted distance is less than the cluster radius, the cluster is classified as having no interaction. The cluster radius is set to 0.5.

[0096] The performance metrics of the model are evaluated using accuracy (ACC), sensitivity (SN), specificity (SP), Matthews correlation coefficient (MCC), and F1 score (F1), as shown in the following formulas:

[0097]

[0098]

[0099]

[0100]

[0101]

[0102] TP, TN, FP, and FN represent true positive, true negative, false positive, and false negative, respectively.

[0103] (Note: Each sample has a label. After inputting the sample into the trained model, a classification is obtained. When both the label and the predicted value are positive, the classification result is a true positive (TP); when both the label and the predicted value are negative, the classification result is a true negative (TN); when the label is positive and the predicted value is negative, the classification result is a false negative (FN); when the label is negative and the predicted value is positive, the classification result is a false positive (FP).)

[0104] To verify the effectiveness of our invention in predicting lncRNA-protein interactions, we compared our method with several other existing methods on the public dataset RPI7317, including IPMiner, PLRPIM, and LGFC-CNN. We used 5-fold cross-validation, and the average of the 5-fold cross-validation results was taken as the final result (see Table 3 below). Both our method and the comparison methods used the complete set of positive and negative samples, i.e., an imbalanced set of positive and negative samples.

[0105]

[0106] Table 3

[0107] Experimental results show that our method outperforms several comparative methods in ACC, SN, MCC, AUC, and F1 scores when dealing with imbalanced positive and negative samples, indicating that our method is more effective in predicting lncRNA-protein interactions. Furthermore, we compared the method using six cluster centers with the method using a single cluster center; the results show that the model's performance metrics are superior when using multiple cluster centers. This also demonstrates the effectiveness of our multi-cluster center method.

[0108] In addition, we compared the confusion matrix results of our method with those of several other methods. We compared the sum of TP and FP with the sum of TN and FN, that is, we compared the proportion of predicted interactions to non-interactions. Figure 5 While improving prediction accuracy, our model has a higher proportion of positive samples in the prediction results. Experimental results show that, while maintaining overall accuracy, our model is more inclined to classify samples as having interactions. In other words, our model is better at discovering potential lncRNA-protein interaction pairs, making it less likely for researchers to miss possible lncRNA-protein interactions.

[0109] Our method employs a joint prediction approach using multiple cluster centers. For a sample to be classified as negative, its m features must converge not only around their corresponding cluster centers but also simultaneously with multiple cluster centers. Compared to the constraint of a single cluster center, applying multiple cluster centers significantly tightens the condition for classifying a sample as negative. For samples with a clearly defined classification, they are typically either very close to or very far from the cluster centers. Therefore, summing the predicted distances across multiple feature spaces does not change their classification. However, for samples with ambiguous classifications, their predicted distances are close to the cluster radius, and the sum of the predicted distances to multiple cluster centers significantly increases, making the samples more likely to be classified as positive samples with interactions. While ensuring classification accuracy, the logic of classifying more samples as positive aligns with the needs of lncRNA-protein interaction prediction, making our model better at uncovering potential lncRNA-protein interaction pairs.

[0110] The above embodiments are only used to illustrate the technical solutions of the present invention, and are not intended to limit it. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A cluster-based method for predicting lncRNA-protein interactions, characterized in that, Includes the following steps: Obtain the dataset of lncRNAs and proteins to be predicted, use segmented k-mer encoding to obtain the features of lncRNAs and proteins, input the features of lncRNAs and proteins into the trained lncRNA-protein interaction prediction model, and obtain the interaction prediction results. The trained lncRNA-protein interaction prediction model was obtained through the following methods: (1) Pre-training process: S1: Obtain a dataset containing lncRNA sequences, protein sequences, and the interaction relationships between lncRNA and protein sequences; the lncRNA sequence information in S1 is obtained from the Gencode v29 database, the protein sequence information is obtained from the UniProt database, and the interaction relationships between lncRNA and protein are obtained from the NPInter v3.0 database. S2: Segment the lncRNA sequence and protein sequence into k-mer encodings to obtain m sets of features; S3: Input m sets of features into m autoencoders respectively, calculate the deviation between the input and output of the autoencoders as the loss of the pre-training process, and continuously train to gradually reduce the deviation. Finally, the pre-training ends and the pre-trained autoencoder model is obtained. The m autoencoders are independent of each other, and the loss of each autoencoder is calculated separately, as shown in the following formula: Where x i Indicates the encoder input, x o This represents the output of the decoder, where d represents x. i The dimension is n, where n is the number of samples; (2) Training process: S4: Calculate cluster centers using the pre-trained autoencoder model from step S3: Each of the m features of each sample in the dataset is input into m autoencoders. If there are n samples in the dataset, each autoencoder will produce n output vectors. The average of these vectors is used as the cluster center, resulting in a total of m cluster centers. S5: Discard the decoder part of the pre-trained autoencoder model and keep only the encoder part; S6: Input the m sets of features of each sample into m encoders to obtain m output vectors; calculate the deviation values ​​between the m output vectors and the corresponding m cluster centers, and use them as the training loss. The formula for calculating the training loss of the j-th encoder among the m encoders is as follows: Where n represents the number of training samples, x ij c represents the output of the i-th sample in the j-th autoencoder. j Let y represent the j-th cluster center. i This is the calculated weight of the i-th sample, with 1 for samples without interaction and -1 for samples with interaction; each loss is calculated independently. (3) Prediction process: S7: The features of lncRNA and protein are concatenated and fed into m trained encoders to obtain m output vectors; S8: Calculate the distance between the m output vectors and the m cluster centers to obtain m predicted distance values; 3. Sum these m predicted distance values ​​to obtain the total predicted distance; for a lncRNA-protein sample, the formula for calculating the total distance is as follows: Where x ij c represents the output feature of the i-th sample in the j-th encoder. j This represents the j-th cluster center; S9: Compare the total predicted distance with the cluster radius. If the total predicted distance is greater than the cluster radius, it is classified as having an interaction; if the total predicted distance is less than the cluster radius, it is classified as having no interaction.

2. The prediction method according to claim 1, characterized in that, The segmented k-mer encodings all include encodings of both global and local sequences, enabling the feature representation of lncRNAs and proteins to possess both local and global characteristics.

3. The prediction method according to claim 1, characterized in that, The segmented k-mer encoding is as follows: a sequence is divided into m equal segments, and a k-mer code is calculated for each segment; the segments are named k-mer(segment 1), k-mer(segment 2)...k-mer(segment m-1) in order, and the global k-mer code is represented by k-mer(global).

4. The prediction method according to claim 1, characterized in that, The pre-training process in step (1) uses the Adam optimizer, and the batch size, learning rate, dropout parameters and number of pre-training algebras are set as needed.

5. The prediction method according to claim 4, characterized in that, The pre-training process in step (1) uses the Adam optimizer, sets the batch size to 64, the learning rate to 0.001, the dropout to 0.2, and the number of pre-training generations to 15.

6. The prediction method according to claim 1, characterized in that, The pre-training process in step (2) uses the Adam optimizer, sets the batch size, learning rate, dropout parameters and number of training generations as needed, and uses a learning rate reduction strategy.

7. The prediction method according to claim 6, characterized in that, The training process in step (2) uses the Adam optimizer, sets the batch size to 64, the dropout to 0.2, and the learning rate to 0.

001. A learning rate reduction strategy is used to make the training process of the model more stable. The number of training generations is set to 20. At the 15th and 18th generations, the learning rate is reduced to 0.1 times the previous value.

8. The prediction method according to claim 1, characterized in that, The autoencoder in step S3 includes an encoder part and a decoder part. The encoder part includes three layers of one-dimensional convolutional neural networks and three layers of fully connected layers. The kernel sizes of the convolutional neural networks are 6, 6, and 5, respectively; the number of channels are 16, 64, and 16, respectively; and the number of neurons in the fully connected layers are 256, 128, and 64, respectively. The decoder part includes three layers of one-dimensional transposed convolutional networks and three layers of fully connected layers. The kernel sizes of the three layers of one-dimensional transposed convolutional networks are 5, 6, and 6, respectively; the number of channels are 64, 16, and 1, respectively; and the number of neurons in the fully connected layers are 128, 256, and 2272, respectively.

9. The prediction method according to claim 1, characterized in that, In step S9, the cluster radius is 0.5.