Protein virtual screening method, apparatus, device, and storage medium

By using protein sequence-based Transformer and BiLSTM network models, the problems of high computational cost and uncertainty in existing virtual screening methods are solved, enabling efficient and accurate screening of biological targets lacking three-dimensional structural information, reducing computational costs and improving the accuracy of simulating flexible proteins.

CN115620816BActive Publication Date: 2026-06-12HUIYI KEJI (SHANGHAI) LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
HUIYI KEJI (SHANGHAI) LTD
Filing Date
2022-09-28
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing virtual screening methods based on protein crystal structures are computationally expensive and have high uncertainty, and their application to biological targets lacking three-dimensional structural information is limited.

Method used

A protein sequence-based approach was adopted, utilizing the Transformer model and BiLSTM network model. Through unsupervised pre-training and multidimensional symmetric matrix coupling, the classification and regression values ​​of protein-small molecule interactions were fitted to establish a virtual protein screening model.

🎯Benefits of technology

It significantly improves the accuracy and speed of virtual screening, expands the application of biological targets lacking three-dimensional structural information, reduces computational costs, and enhances the accuracy of simulating flexible proteins.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN115620816B_ABST
    Figure CN115620816B_ABST
Patent Text Reader

Abstract

The application provides a protein virtual screening method, device and equipment and a storage medium, and belongs to the field of drug discovery. The method comprises the following steps: obtaining a training sample set, wherein the training sample set comprises source data and sample data corresponding to the source data; performing unsupervised pre-training on a Transformer model by taking the source data as input and the sample data as verification, and generating one-dimensional or multi-dimensional symmetric matrices for protein sequences and ligand sequences respectively; coupling the two matrices of the protein sequences and the ligand sequences into a multi-dimensional symmetric matrix, and taking the multi-dimensional symmetric matrix as the input of a hidden layer of a BiLSTM network model; fitting experimental measurement classification and regression values of the interaction between proteins and small molecules by using the BiLSTM network model to obtain a trained screening model; and performing prediction on different protein prediction tasks by using the screening model and outputting a prediction result. According to the processing scheme, the interaction relationship between proteins and small molecule drugs can be efficiently, quickly and accurately predicted.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of drug discovery, and more specifically to a method, apparatus, device, and storage medium for virtual screening of proteins. Background Technology

[0002] Drug discovery has long been a time-consuming and costly process. With the development of computer technology, computational methods have been widely applied in drug development, and virtual drug screening is one of the most valuable technologies in this field. Among them, artificial intelligence methods for virtual screening based on protein crystal structures use the three-dimensional structural information of protein targets to predict the binding of proteins to small drug molecules.

[0003] Common virtual screening methods based on protein crystal structures often employ molecular docking techniques (see [M. Drwaland, R. Griffith. Combination of ligand-and structure-based methods in virtual screening], [S. Ghosh, A. Nie, J. An, and Z. Huang. Structure-based virtual screening of chemical libraries for drug discovery.], [J. Gilmer, S. Schoenholz, P. Riley, O. Vinyals, and G. Dahl. Neural message passing for quantum chemistry.]). These methods utilize molecular dynamics and quantum chemistry to simulate the binding process of small molecules to proteins. This involves searching for binding sites and calculating the spatial position of small molecules using heuristic functions, which consumes significant computational resources and introduces a degree of uncertainty due to the multiple possible outcomes. Furthermore, molecular docking-based methods rely on the three-dimensional structural information of proteins, which is largely unknown in most biological proteins, thus limiting their widespread use in the pharmaceutical industry.

[0004] Therefore, it is necessary to establish a set of artificial intelligence technologies that can go beyond protein sequence-based virtual screening in order to expand the scope of artificial intelligence training data and improve the accuracy and speed of virtual screening. Summary of the Invention

[0005] Therefore, in order to overcome the shortcomings of the prior art, the present invention provides a protein virtual screening method, apparatus, device and storage medium for predicting the interaction between proteins and small drug molecules efficiently, rapidly and accurately based on protein sequences.

[0006] To achieve the above objectives, this invention provides a virtual protein screening method, comprising: acquiring a training sample set, wherein the training sample set includes source data and sample data corresponding to the source data, the source data containing a protein sequence and a ligand sequence of a small molecule compound that can bind to the protein, and the sample data containing subsequences of the binding sites of the protein and the small molecule compound; performing unsupervised pre-training on a Transformer model using the source data as input and the sample data as validation, generating one-dimensional or multi-dimensional symmetric matrices for the protein sequence and the ligand sequence respectively; coupling the two matrices of the protein sequence and the ligand sequence into a multi-dimensional symmetric matrix, and using the multi-dimensional symmetric matrix as input to the hidden layer of a BiLSTM network model; fitting experimental measurement classification and regression values ​​of protein-small molecule interactions using the BiLSTM network model to obtain a trained screening model; and using the screening model to perform predictions for different protein prediction tasks and outputting the prediction results.

[0007] In one embodiment, obtaining the training sample set includes: executing a masking strategy based on a BERT-style masking language model to randomly mask the binding sites of protein targets, obtaining discontinuous protein sequence representations, and using the discontinuous protein sequence representations as sample data; storing the source data and the sample data in correspondence to obtain the training sample set.

[0008] In one embodiment, the step of unsupervised pre-training of the Transformer model using the source data as input and the sample data as validation, and generating one-dimensional or multi-dimensional symmetric matrices for the protein sequence and the ligand sequence respectively, includes: inputting the training sample sets of the protein sequence and the ligand sequence into the input embedding layer of the Transformer model respectively; performing embedding processing on the training sample sets through the input embedding layer of the Transformer model; inputting the embedded training sample sets into the intermediate hidden layer of the Transformer model; learning the feature representation of the embedded training sample sets through the intermediate hidden layer of the Transformer model; and outputting the learned feature representation, which is a one-dimensional or multi-dimensional symmetric matrix, through the output prediction layer of the Transformer model.

[0009] In one embodiment, coupling the two matrices of the protein sequence and the ligand sequence into a multidimensional symmetric matrix includes: coupling the two matrices of the protein sequence and the ligand sequence using a QSAR model to obtain a multidimensional symmetric matrix.

[0010] In one embodiment, the step of using a BiLSTM network model to fit experimental measurement classification and regression values ​​of protein-small molecule interactions to obtain a trained screening model includes: obtaining the protein sequence X = (x1, ..., x...). L ), i = 1, ..., L, where X represents all amino acids; This represents a point mutation at position i, and the mutation sequence is... sequence context x [L]\{i} = (x1, ..., x i-1 x i+1 ,…,x L ); using vectors z i =f e (x [L]\{i} The sequence context is encoded, where f e yes It is an embedding function that maps discrete sequences to a D-dimensional continuous space. The embedding function is instantiated by a bidirectional LSTM neural network, and the outputs of the final LSTM layer are concatenated to form an embedding vector, resulting in z. i =[LSTM f (g f (x1, ..., x) i-1 LSTM r (g r (x i+1 , ..., x L ))], where g f It's the output of the first few layers, which are the positive inputs of the LSTM. f It is the last layer of the forward LSTM; g r and g f The definitions are similar, but the directions are opposite; LSTM r and LSTM f The definitions are similar, but the directions are opposite; the embedding vector z i By learning to fit experimentally measured classification and regression values ​​of protein-small molecule interactions using transformations and a softmax function, a well-trained screening model is obtained. The softmax function is p(x) i |x [L]\{i} )=p(x i |z i = softmax(Wz) i +b), where w and b are the learning parameters.

[0011] A virtual protein screening device includes: a sample set processing module for acquiring a training sample set, the training sample set including source data and sample data corresponding to the source data, the source data including a protein sequence and a ligand sequence of a small molecule compound that can bind to the protein, and the sample data including sub-sequences of the binding sites of the protein and the small molecule compound; an unsupervised pre-training module for performing unsupervised pre-training on a Transformer model using the source data as input and the sample data as validation, generating one-dimensional or multi-dimensional symmetric matrices for the protein sequence and the ligand sequence, respectively; an input setting module for coupling the two matrices of the protein sequence and the ligand sequence into a multi-dimensional symmetric matrix, and using the multi-dimensional symmetric matrix as input to the hidden layer of a BiLSTM network model; a model training module for fitting experimental measurement classification and regression values ​​of protein-small molecule interactions using the BiLSTM network model to obtain a trained screening model; and a prediction module for performing predictions on different protein prediction tasks using the screening model and outputting prediction results.

[0012] A computer device includes a memory and a processor, the memory storing a computer program, characterized in that the processor executes the computer program to implement the steps of the above-described method.

[0013] A computer-readable storage medium having a computer program stored thereon, characterized in that the computer program, when executed by a processor, implements the steps of the above-described method.

[0014] Compared with existing technologies, the advantages of this invention are as follows: It directly predicts the interaction between proteins and small molecules based on protein sequence information, thereby greatly expanding the application of virtual screening technology in drug design for different biological targets, especially those lacking three-dimensional structural information. Furthermore, by utilizing big data and deep learning methods, it saves more than 10,000 times the computational cost compared to molecular dynamics, quantum mechanics, and quantum chemistry, significantly improving computational speed. In addition, the simulation method using elastic graph neural networks can better adapt to simulating highly flexible proteins, thereby reducing the flexibility gap between the simulated protein conformation and the conformation of real physiological proteins, thus reducing simulation errors and significantly improving the accuracy of protein-small molecule binding prediction. This application also establishes a standard virtual screening method based on protein sequences, laying the foundation for the future development of such methods. Attached Figure Description

[0015] To more clearly illustrate the technical solutions of the embodiments of this disclosure, the drawings used in the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this disclosure. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0016] Figure 1 This is a schematic flowchart of the protein virtual screening method in an embodiment of the present invention;

[0017] Figure 2 This is a flowchart illustrating the protein virtual screening step in an embodiment of the present invention;

[0018] Figure 3 This is a structural block diagram of the protein virtual screening device in an embodiment of the present invention;

[0019] Figure 4 This is an internal structural diagram of a computer device in an embodiment of the present invention. Detailed Implementation

[0020] The embodiments of this disclosure will now be described in detail with reference to the accompanying drawings.

[0021] The following specific examples illustrate the implementation of this disclosure. Those skilled in the art can easily understand other advantages and effects of this disclosure from the content disclosed in this specification. Obviously, the described embodiments are only a part of the embodiments of this disclosure, and not all of them. This disclosure can also be implemented or applied through other different specific embodiments, and the details in this specification can also be modified or changed based on different viewpoints and applications without departing from the spirit of this disclosure. It should be noted that, in the absence of conflict, the following embodiments and features in the embodiments can be combined with each other. Based on the embodiments in this disclosure, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this disclosure.

[0022] It should be noted that various aspects of embodiments within the scope of the appended claims are described below. It will be apparent that the aspects described herein can be embodied in a wide variety of forms, and any particular structure and / or function described herein is merely illustrative. Based on this disclosure, those skilled in the art will understand that one aspect described herein can be implemented independently of any other aspect, and two or more of these aspects can be combined in various ways. For example, any number of aspects set forth herein can be used to implement the device and / or practice the method. Additionally, this device and / or method can be implemented using other structures and / or functionalities besides one or more of the aspects set forth herein.

[0023] It should also be noted that the illustrations provided in the following embodiments are only schematic representations of the basic concept of this disclosure. The drawings only show the components related to this disclosure and are not drawn according to the number, shape and size of the components in actual implementation. In actual implementation, the form, quantity and proportion of each component can be arbitrarily changed, and the layout of the components may also be more complex.

[0024] Furthermore, specific details are provided in the following description to facilitate a thorough understanding of the examples. However, those skilled in the art will understand that the described aspects can be practiced without these specific details.

[0025] like Figure 1 As shown, this disclosure provides a virtual protein screening method that can be applied to a terminal or a server. The terminal can be, but is not limited to, various personal computers, laptops, smartphones, tablets, and portable smart devices. The server can be a standalone server or a server cluster consisting of multiple servers. The method includes the following steps:

[0026] Step 101: Obtain the training sample set. The training sample set includes source data and sample data corresponding to the source data. The source data includes protein sequences and ligand sequences of small molecule compounds that can bind to the protein. The sample data includes subsequences of the binding sites of the protein and the small molecule compounds.

[0027] The source data includes protein sequences and ligand sequences of small molecule compounds that can bind to those proteins. The sample data includes subsequences of the binding sites between the proteins and the small molecule compounds. The server can directly retrieve the training sample set from the database or perform secondary processing on the training sample set. The small molecule compounds are those with enzyme activity experimental records of proteins from the protein data bank (PDB).

[0028] Step 102: Using the source data as input and the sample data as validation, perform unsupervised pre-training on the Transformer model to generate one-dimensional or multi-dimensional symmetric matrices for the protein sequence and ligand sequence, respectively.

[0029] The server takes source data as input and sample data as validation to perform unsupervised pre-training on the Transformer model, generating one-dimensional or multi-dimensional symmetric matrices for protein and ligand sequences, respectively. Based on the Transformer model architecture, the server can utilize massive amounts of unlabeled source data and apply a BERT-style Masked Language Model for unsupervised pre-training, ultimately obtaining a protein pre-trained model with powerful representation capabilities. The final output of this model is a one-dimensional or multi-dimensional symmetric matrix. The Transformer model consists of sequentially connected input embedding layers, intermediate hidden layers, and output prediction layers. The intermediate hidden layers consist of N Transformer modules, each of which includes sequentially connected multi-head attention layers, a first Dropout layer, a first Add&Norm layer, a feedforward layer, a second Dropout layer, and a second Add&Norm layer. Here, Add stands for Residual Connection, used to prevent network degradation; Norm stands for Layer Normalization, used to normalize the activation values ​​of each layer. Each Transformer module contains two sub-layers: a multi-head attention layer and a feed-forward layer. Each sub-layer is followed by Dropout, residual connection (Add), and layer normalization (Norm) operations to learn the feature representation of the binding sites of the masked protein and small molecule sequences. After passing through all Transformer modules, the Transformer model has fully learned the high-dimensional feature representation of the masked binding site sequence. Finally, the learned feature representation is input into the output prediction layer to predict the binding site at the masked location.

[0030] Step 103: Couple the two matrices of protein sequence and ligand sequence into a multidimensional symmetric matrix, and use the multidimensional symmetric matrix as the input of the hidden layer of the BiLSTM network model.

[0031] The server couples the protein sequence and ligand sequence matrices into a multidimensional symmetric matrix, which is then used as the input to the hidden layer of the BiLSTM network model. The server feeds the same input sequence into two LSTMs, one forward and one backward, and then connects the hidden layers of the two networks together, feeding them into the output layer for prediction.

[0032] Step 104: Use a BiLSTM network model to fit the experimental measurement classification and regression values ​​of protein-small molecule interactions to obtain a trained screening model.

[0033] The server uses a BiLSTM network model to fit the experimental measurement classification and regression values ​​of protein-small molecule interactions to obtain a well-trained screening model.

[0034] Step 105: Use the screening model to perform different protein prediction tasks and output the prediction results.

[0035] The server uses a screening model to perform different protein prediction tasks and outputs the prediction results. The prediction results can be protein sequences, or, as needed, the presence or absence of the protein sequence.

[0036] The aforementioned method directly predicts the interaction between proteins and small molecules based on protein sequence information, thus greatly expanding the application of virtual screening technology in drug design for various biological targets, especially those lacking three-dimensional structural information. Furthermore, by utilizing big data and deep learning methods, it saves over 10,000 times the computational cost compared to molecular dynamics, quantum mechanics, and quantum chemistry, significantly improving computational speed. In addition, the simulation method using elastic graph neural networks can better adapt to simulating highly flexible proteins, thereby reducing the flexibility gap between the simulated protein conformation and the conformation of real physiological proteins, thus reducing simulation errors and significantly improving the accuracy of protein-small molecule binding prediction. This application also establishes a standard virtual screening method based on protein sequences, laying the foundation for the future development of such methods.

[0037] In one embodiment, obtaining a training sample set includes the following steps: based on a BERT-style masking language model, executing a masking strategy to randomly mask the binding sites of protein targets to obtain discontinuous protein sequence representations, and using the discontinuous protein sequence representations as sample data; storing the source data and sample data in correspondence to obtain the training sample set.

[0038] The server, based on a BERT-style masking language model, executes a masking strategy to randomly mask the binding sites of protein targets, resulting in discontinuous protein sequence representations. These discontinuous protein sequence representations are then used as sample data. The server can randomly mask k binding sites out of n binding sites. The masking strategy may include: the k masked binding sites comprise 30-40% of the n binding sites; 80% of the k masked binding sites are directly masked; another 10% are replaced with other proteins; and the remaining 10% remain unchanged.

[0039] The server stores the source data and sample data in correspondence to obtain the training sample set.

[0040] like Figure 2 As shown, in one embodiment, using source data as input and sample data as validation, the Transformer model is unsupervised pre-trained to generate one-dimensional or multi-dimensional symmetric matrices for the protein sequence and ligand sequence, respectively, including:

[0041] Step 201: Input the training sample sets of protein sequences and ligand sequences into the input embedding layer of the Transformer model, respectively.

[0042] The server can transmit source data X = {x1, x2, ..., x...} n} is represented as an embedding matrix E = [e1, e2, ..., e n ], where each column e in the matrix i This represents the embedding vector for the corresponding term. Protein sequences and ligand sequences are different embedding matrices; therefore, protein sequences and ligand sequences can be one-dimensional or multi-dimensional symmetric matrices. The protein sequence matrix is ​​obtained by vector encoding of amino acids; for example, the i-th amino acid x in the protein... i It can be represented by single-point terms and paired amino acids x. j The coupling terms form a vector e ij (x i ;x j ), and the vector e of all amino acids in the protein. ij (x i ;x j By combining these components, a multidimensional symmetric matrix of the protein sequence can be obtained. Similarly, ligand sequences can be composed of single-point terms representing themselves and coupling terms representing paired amino acids to obtain ligand vectors.

[0043] Step 202: The training sample set is embedded through the input embedding layer of the Transformer model.

[0044] The server embeds the training sample set through the input embedding layer of the Transformer model. The server can then use dimensionality reduction techniques to project the matrix of all protein sequences in the training sample set to a lower dimension. Dimensionality reduction techniques are common data reduction methods that can eliminate redundancy between data and determine the coding structure and stability of proteins.

[0045] Step 203: Input the embedded training sample set into the intermediate hidden layer of the Transformer model.

[0046] The server feeds the embedded training sample set into the intermediate hidden layers of the Transformer model. The Transformer model learns the probability of a particular amino acid appearing at a given location by using all other amino acids surrounding it as context. During training, the Transformer model gradually modifies its internal dynamics (encoded as hidden state vectors) to maximize prediction accuracy. The server also uses the hidden state vectors of the Transformer model as an alternative protein sequence representation for the prediction model to capture global protein sequence context, complementing the local evolutionary context representation.

[0047] Step 204: Learn the feature representation of the training sample set after embedding through the intermediate hidden layer of the Transformer model.

[0048] The server learns feature representations of the training sample set after embedding through intermediate hidden layers of the Transformer model. These intermediate hidden layers (e.g., recurrent neural networks) take these sequence representations as input and learn the relationship between the sequences and functions.

[0049] Step 205: The learned feature representation is output through the output prediction layer of the Transformer model. The feature representation is a one-dimensional or multi-dimensional symmetric matrix.

[0050] The server uses the output of the Transformer model to predict the output of the feature representation, which is a one-dimensional or multi-dimensional symmetric matrix.

[0051] In one embodiment, the two matrices representing the protein sequence and the ligand sequence are coupled into a multidimensional symmetric matrix. This involves using a QSAR model to couple the two matrices to obtain the multidimensional symmetric matrix. The QSAR model uses a mathematical model to describe the relationship between molecular structure and a molecule's biological activity. The basic assumption of QSAR is that the molecular structure of a compound contains information that determines its physical, chemical, and biological properties, and these physicochemical properties further determine the compound's biological activity; consequently, the molecular structure properties of a compound should also be correlated to some extent with its biological activity.

[0052] In one embodiment, a BiLSTM network model is used to fit the experimental measurement classification and regression values ​​of protein-small molecule interactions to obtain a trained screening model, including: obtaining the protein sequence X = (x1, ..., x...). L ), i = 1, ..., L, where X represents all amino acids; This represents a point mutation at position i, and the mutation sequence is... sequence context x [L]\{i} = (x1, ..., x i-1 x i+1 , ..., x L );

[0053] Using vectors z i =f e (x [L]\{i} Encode the sequence context, where f e yes It is an embedding function that maps discrete sequences to a D-dimensional continuous space. The embedding function is instantiated by a bidirectional LSTM neural network, and the outputs of the final LSTM layer are concatenated to form an embedding vector, resulting in z. i =[LSTM f (g f (x1, ..., x) i-1 LSTM r (g r (x i+1 , ..., x L ))], where g f It's the output of the first few layers, which are the positive inputs of the LSTM. f It is the last layer of the forward LSTM; g r and g f The definitions are similar, but the directions are opposite; LSTM r and LSTM f The definitions are similar, but the directions are opposite;

[0054] Embedded vector z i By learning transformations and softmax functions to fit experimentally measured classification and regression values ​​of protein-small molecule interactions, a well-trained screening model is obtained.

[0055] The softmax function is p(x) i |x [L]\{i} )=p(x i |z i = softmax(Wz) i +b), where W and b are the learning parameters.

[0056] In one embodiment, such as Figure 3 As shown, a protein virtual screening device is provided, which includes a sample set processing module 301, an unsupervised pre-training module 302, an input setting module 303, a model training module 304, and a prediction module 305.

[0057] The sample set processing module 301 is used to acquire a training sample set, which includes source data and sample data corresponding to the source data. The source data includes protein sequences and ligand sequences of small molecule compounds that can bind to the protein. The sample data includes subsequences of the binding sites of the protein and the small molecule compounds.

[0058] The unsupervised pre-training module 302 is used to perform unsupervised pre-training of the Transformer model with source data as input and sample data as validation, generating one-dimensional or multi-dimensional symmetric matrices for protein sequences and ligand sequences, respectively.

[0059] The input setting module 303 is used to couple the two matrices of protein sequence and ligand sequence into a multidimensional symmetric matrix, and use the multidimensional symmetric matrix as the input of the hidden layer of the BiLSTM network model.

[0060] Model training module 304 is used to fit experimental measurement classification and regression values ​​of protein-small molecule interactions using a BiLSTM network model to obtain a trained screening model.

[0061] The prediction module 305 is used to perform predictions for different protein prediction tasks using the screening model and output the prediction results.

[0062] In one embodiment, the sample set processing module includes:

[0063] The masking unit is used to execute a masking strategy based on the BERT-style masking language model, randomly masking the binding sites of protein targets to obtain discontinuous protein sequence representations, which are then used as sample data.

[0064] The storage unit is used to store the source data and sample data in correspondence to obtain the training sample set.

[0065] In one embodiment, the unsupervised pre-training module includes:

[0066] The input unit is used to input the training sample sets of protein sequences and ligand sequences into the input embedding layer of the Transformer model, respectively.

[0067] Embedding units are used to embed the training sample set through the input embedding layer of the Transformer model.

[0068] Hidden units are used to input the embedded training sample set into the intermediate hidden layers of the Transformer model.

[0069] The representation unit is used to learn the feature representation of the training sample set after embedding through the intermediate hidden layers of the Transformer model.

[0070] The output unit is used to output the learned feature representation through the output prediction layer of the Transformer model. The feature representation is a one-dimensional or multi-dimensional symmetric matrix.

[0071] In one embodiment, the model training module includes:

[0072] Sequence acquisition unit, used to acquire protein sequences X = (x1, ..., x2) L ), i = 1, ..., L, where X represents all amino acids; This represents a point mutation at position i, and the mutation sequence is... sequence context x [L]\{i} = (x1, ..., x i-1 x i+1 , ..., x L ).

[0073] Encoding unit, used to employ vectors z i =f e (x [L]\{i} Encode the sequence context, where f e yes It is an embedding function that maps discrete sequences to a D-dimensional continuous space. The embedding function is instantiated by a bidirectional LSTM neural network, and the outputs of the final LSTM layer are concatenated to form an embedding vector, resulting in z. i =[LSTM f (g f (x1, ..., x) i-1 LSTM r (g r (x i+1 , ..., x L ))], where g f It's the output of the first few layers, which are the positive inputs of the LSTM. f It is the last layer of the forward LSTM; g r and g f The definitions are similar, but the directions are opposite; LSTM r and LSTM f The definitions are similar, but the directions are opposite;

[0074] Training unit, used to embed vector z i By learning transformations and softmax functions to fit experimentally measured classification and regression values ​​of protein-small molecule interactions, a well-trained screening model is obtained.

[0075] The softmax function is p(x) i |x [L]\{i} )=p(x i|z i = softmax(Wz) i +b), where W and b are the learning parameters.

[0076] In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as follows: Figure 4 As shown, the computer device includes a processor, memory, network interface, and database connected via a system bus. The processor provides computational and control capabilities. The memory includes non-volatile storage media and internal memory. The non-volatile storage media stores the operating system, computer programs, and database. The internal memory provides an environment for the operation of the operating system and computer programs stored in the non-volatile storage media. The database stores protein sequences, training sample sets, etc. The network interface communicates with external terminals via a network. When executed by the processor, the computer program implements a virtual protein screening method.

[0077] Those skilled in the art will understand that Figure 4 The structure shown is merely a block diagram of a portion of the structure related to the present application and does not constitute a limitation on the computer device to which the present application is applied. Specific computer devices may include more or fewer components than those shown in the figure, or combine certain components, or have different component arrangements.

[0078] A computer device includes a memory and one or more processors. The memory stores a computer program, which, when executed by the processor, implements the steps of the protein virtual screening method provided in any embodiment of this application.

[0079] One or more computer-readable storage media, wherein when a computer program is executed by one or more processors, the one or more processors cause the one or more processors to implement the steps of the protein virtual screening method provided in any embodiment of this application.

[0080] The above are merely specific embodiments of this disclosure, but the scope of protection of this disclosure is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in this disclosure should be included within the scope of protection of this disclosure. Therefore, the scope of protection of this disclosure should be determined by the scope of the claims.

Claims

1. A method for virtual protein screening, characterized in that, include: Obtain a training sample set, which includes source data and sample data corresponding to the source data. The source data includes protein sequences and ligand sequences of small molecule compounds that can bind to the protein. The sample data includes subsequences of the binding sites of the protein and the small molecule compounds. Using the source data as input and the sample data as validation, the Transformer model is pre-trained in an unsupervised manner to generate one-dimensional or multi-dimensional symmetric matrices for the protein sequence and the ligand sequence, respectively. The two matrices of the protein sequence and the ligand sequence are coupled into a multidimensional symmetric matrix, and the multidimensional symmetric matrix is ​​used as the input of the hidden layer of the BiLSTM network model. The experimental measurement classification and regression values ​​of protein-small molecule interactions were fitted using a BiLSTM network model to obtain a well-trained screening model. The screening model is used to perform different protein prediction tasks and the prediction results are output. The acquisition of the training sample set includes: Based on the BERT-style masking language model, a masking strategy is executed to randomly mask the binding sites of protein targets, resulting in discontinuous protein sequence representations, which are then used as sample data. The source data and the sample data are stored in correspondence to obtain the training sample set.

2. The method according to claim 1, characterized in that, The step of using the source data as input and the sample data as validation to perform unsupervised pre-training of the Transformer model, generating one-dimensional or multi-dimensional symmetric matrices for the protein sequence and the ligand sequence respectively, includes: The training sample set of the protein sequence and the ligand sequence is respectively input into the input embedding layer of the Transformer model; The training sample set is embedded through the input embedding layer of the Transformer model; The embedded training sample set is then input into the intermediate hidden layer of the Transformer model. The feature representation of the training sample set after embedding is learned through the intermediate hidden layer of the Transformer model; The feature representation learned by the output prediction layer of the Transformer model is a one-dimensional or multi-dimensional symmetric matrix.

3. The method according to claim 1, characterized in that, The coupling of the two matrices representing the protein sequence and the ligand sequence into a multidimensional symmetric matrix includes: The QSAR model is used to couple the two matrices of the protein sequence and the ligand sequence to obtain a multidimensional symmetric matrix.

4. The method according to claim 1, characterized in that, The experimental measurement classification and regression values ​​of protein-small molecule interactions are fitted using a BiLSTM network model to obtain a trained screening model, including: Obtain the protein sequence X = (x1, ..., x) L ), i = 1, ..., L, where X is all amino acids; This represents a point mutation at position i, with the mutation sequence x ( ) = (x1, ..., x i-1 , x i+1 ,…, x L ), sequence context x [L]\{i} = (x1, ..., x i-1 x i+1 ,…, x L ); Using vectors The sequence context is encoded, where f e yes It is an embedding function that maps discrete sequences to a D-dimensional continuous space. The embedding function is instantiated by a bidirectional LSTM neural network, and the outputs of the final LSTM layer are concatenated to form an embedding vector, resulting in... , where g f It's the output of the first few layers, which are the positive inputs of the LSTM. f It is the last layer of the forward LSTM; g r and g f The definitions are similar, but the directions are opposite; LSTM r and LSTM f The definitions are similar, but the directions are opposite; Embedded vector z i By learning transformations and softmax functions to fit experimentally measured classification and regression values ​​of protein-small molecule interactions, a well-trained screening model is obtained. The softmax function is , where W and b are the learning parameters.

5. A virtual protein screening device, characterized in that, The device includes: A sample set processing module is used to acquire a training sample set, which includes source data and sample data corresponding to the source data. The source data includes a protein sequence and a ligand sequence of a small molecule compound that can bind to the protein. The sample data includes subsequences of the binding sites of the protein and the small molecule compound. An unsupervised pre-training module is used to perform unsupervised pre-training on the Transformer model with the source data as input and the sample data as validation, and to generate one-dimensional or multi-dimensional symmetric matrices for the protein sequence and the ligand sequence, respectively. The input setting module is used to couple the two matrices of the protein sequence and the ligand sequence into a multidimensional symmetric matrix, and use the multidimensional symmetric matrix as the input of the hidden layer of the BiLSTM network model; The model training module is used to fit the experimental measurement classification and regression values ​​of protein-small molecule interactions using a BiLSTM network model to obtain a trained screening model. The prediction module is used to perform different protein prediction tasks using the screening model and output the prediction results. The acquisition of the training sample set includes: Based on the BERT-style masking language model, a masking strategy is executed to randomly mask the binding sites of protein targets, resulting in discontinuous protein sequence representations, which are then used as sample data. The source data and the sample data are stored in correspondence to obtain the training sample set.

6. A computer device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that, When the processor executes the computer program, it implements the steps of the method according to any one of claims 1 to 4.

7. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1 to 4.