A molecular generation method based on protein structure using a variational flow model

By combining a protein structure-based variational flow model with a GraphSF generative model, ligand molecules are generated using protein receptor information, which solves the problem of unsatisfactory generated structures in existing models and achieves more reasonable molecular generation results.

CN118155755BActive Publication Date: 2026-06-30CHINA PHARM UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
CHINA PHARM UNIV
Filing Date
2024-04-19
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing deep learning models only target small molecule ligands and do not utilize information from protein receptors, resulting in suboptimal molecular structures.

Method used

A variational flow model based on protein structure is adopted to obtain the atomic type and spatial location features of protein receptors. Combined with the pre-trained molecular generation model GraphSF, ligand molecules are generated, including a protein-ligand complex encoder, a reference ligand structure encoder, a local focal atom predictor, and an atom autoregressive flow model. The prior distribution of the structure is encoded using the spherical coordinate message passing method of SphereNet, and molecules are generated through Markov sampling.

Benefits of technology

The generated molecular structure is more rational, improving the distribution of atomic types, interatomic distances, bond angles, and dihedral angles, thus solving the problem of unsatisfactory molecular generation in existing methods.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN118155755B_ABST
    Figure CN118155755B_ABST
Patent Text Reader

Abstract

This invention discloses a molecular generation method based on a variable flow model using protein structure, comprising the following steps: Step 1: Obtain the pdb format file of the protein receptor for which ligand molecules need to be generated; Step 2: Assuming the protein receptor for which ligand molecules need to be generated comprises p atoms, extract the atom type features and atom spatial position features of the protein receptor; the atom type features are feature vectors of size p×1, and the atom spatial position features are feature vectors of size p×3; the atom spatial position includes: distance, bond angle, and dihedral angle; Step 3: Input the extracted atom type features and atom spatial position features of the protein receptor into a pre-trained molecular generation model GraphSF to generate several atoms; Step 4: Add bonds according to the interatomic distances, and finally output the corresponding ligand molecule in SDF format.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of drug design technology based on protein structure, specifically a molecular generation method based on protein structure using a variable flow model. Background Technology

[0002] The process of developing a lead compound into a drug candidate is time-consuming, labor-intensive, and carries significant risks. In recent years, with the continuous development of artificial intelligence, deep learning, with its data-driven nature, has been able to generate novel compound molecules with similar properties and unique structures by learning the distribution of drug molecule data.

[0003] Currently, most models in the field of molecular generation only target small molecule ligands and do not utilize information about protein receptors. Some molecular generation methods based on protein receptor structures rarely consider the three-dimensional structural information of proteins and their ligands, resulting in less than ideal structures of the generated molecules. Summary of the Invention

[0004] Purpose of the invention: To address the problem that existing deep learning models only target small molecule ligands and do not utilize information about protein receptors, and to solve the problem that existing molecular generation methods produce unsatisfactory molecular structures, this invention proposes a molecular generation method based on protein structure using a variable flow model. By sampling the prior distribution of ligand structures, the generated molecular structure becomes more reasonable.

[0005] Technical solution: A molecular generation method based on protein structure using a variational current model, comprising the following steps:

[0006] Step 1: Obtain the pdb format file of the protein receptor for which ligand molecules need to be generated;

[0007] Step 2: Assuming the protein receptor for which ligand molecules need to be generated consists of p atoms, extract the atom type features and atom spatial position features of the protein receptor; the atom type features are feature vectors of size p×1, and the atom spatial position features are feature vectors of size p×3; the atom spatial position includes: distance, bond angle, and dihedral angle;

[0008] Step 3: Input the extracted atom type features and atom spatial position features of the protein receptor into the pre-trained molecular generation model GraphSF to generate several atoms;

[0009] Step 4: Add bonds based on the interatomic distances, and finally output the corresponding ligand molecule in SDF format.

[0010] Furthermore, the molecular generation model GraphSF includes: a protein-ligand complex encoder, a reference ligand structure encoder, a local focal atom predictor, and an atom autoregressive flow model;

[0011] The protein-ligand complex encoder is a variant of SchNet, mainly comprising: an embedding layer and an interaction layer;

[0012] Suppose that for a protein-ligand interaction pair, the protein receptor has p atoms and the ligand molecule has m atoms. If, at any time, the Euclidean distance between the two atoms is less than a set cutoff distance, then it is determined that there is an edge relationship between the two atoms.

[0013] Assume that, when generating atoms at time t, the atoms already generated can form the following 3D radius diagram:

[0014] G p+t-1 ={a1,r1,a2,r2,a3,r3,......,a p+t-1 ,r p+t-1}

[0015] In the formula, a1, a2, ..., a p+t-1 This indicates that the atom types generated at time t are r1, r2, ..., r p+t-1 This indicates the spatial position of the atoms that have been generated at time t;

[0016] After the embedding layer, the atomic type is embedded into the vector space, represented as:

[0017] h1, h2, h3, ..., h p+t-1 =Embedding(a1,a2,a3,...,a p+t-1 )

[0018] In the formula, h p+t-1 Let represent the embedding vector of the (p+t-1)th atom in the protein-ligand complex at time t;

[0019] The Interaction layer sequentially includes a first atom-wise layer, a continuous filtering convolutional layer, a second atom-wise layer, an activation layer, and a third atom-wise layer;

[0020] In the continuous filtering convolutional layer, the edge length e in the 3D radius graph is first calculated. weight :

[0021] e weight =||r i -r j ||

[0022] In the formula, r i r j These represent the spatial positions of the i-th atom and the j-th atom, respectively.

[0023] Then, the Gaussian radial basis function is used to adjust the side length e. weight Embedded and propagated forward through the fully connected layer:

[0024] W l =MLP l (e RBF (e weight ))

[0025] In the formula, MLP l Represents the l-th fully connected layer, e RBF (e weight ) represents the side length e weight High-dimensional vector representation;

[0026] Then, message passing and message updates are performed on the hidden variables of each atom, as follows:

[0027]

[0028] In the formula, The matrix represents the neighboring atoms of atom k in the 3D radius diagram. Atoms whose distance from atom k is less than the cutoff distance are all considered neighboring atoms. ⊙ represents the Hadamard product between matrices. This represents the hidden variable obtained after passing through the Embedding layer and then through l-1 Interaction layers;

[0029] The reference ligand structure encoder is a variant of SphereNet, used to encode the structural information of ligand molecules in protein-ligand interaction pairs into a structural prior distribution using SphereNet's spherical coordinate message passing method.

[0030] The local focal atom predictor is used to perform binary classification of atoms, and to predict which focal atom constitutes the local spherical coordinate system in which the atom is located when the atom is generated at time t.

[0031] The process of a protein receptor generating its new ligand molecules is modeled as a Markov sampling process. The atom autoregressive flow model includes four flow models with non-shared weights: a flow model for generating atom types, a flow model for generating distances, a flow model for generating bond angles, and a flow model for generating dihedral angles. In the stage of generating new atoms, the flow model samples from the prior structural distribution obtained from the reference ligand structure encoder and combines it with the encoded representation of local focal atoms to generate the corresponding atom types and atom spatial positions.

[0032] Furthermore, for the reference ligand structure encoder, the method of using SphereNet's spherical coordinate message passing to encode the structural information of the ligand molecule in the protein-ligand interaction pair into a structural prior distribution includes the following specific operations:

[0033] Representing a ligand molecule with m atoms as a 3D radius diagram:

[0034] G m ={(a1,r1),(a2,r2),(a3,r3),......,(a m ,r m )}

[0035] In the formula, (a m ,r m () represents the element type and spatial position of the m-th atom, respectively;

[0036] 3D radius map of the reference ligand encoded using SphereNet:

[0037] u G =SphereNet(G m )

[0038] Using SphereNet's spherical coordinate message passing method, the edge features and atom features on the 3D radius map of the reference ligand are updated:

[0039]

[0040] v' i =u v (v i ,ρ e→v (E i ))

[0041] In the formula, and These represent two message update functions, ρ p→e (·) and ρ e→v (·) is the message aggregation function, e k The characteristic of the k-th edge is represented by v. i The characteristic of atom i, r k and s k These are the outgoing and incoming atoms of the k-th edge, respectively. Let the characteristics of the outgoing and incoming atoms of the k-th edge be represented respectively. Indicates atom s k All neighboring origin points, and E i To be with atom s k Atomic characteristics associated with atom i; e'k and v' i Let represent the updated features of the k-th edge and the feature of atom i, respectively.

[0042] Summing the tensors of all nodes yields the overall embedding tensor of the molecular graph:

[0043]

[0044] In the formula, This represents the characteristic tensor of the i-th atom in the reference ligand. The characteristic tensor representing the entire molecular graph;

[0045] Finally, the mean and variance of the structural prior distribution are predicted using a multilayer perceptron. The obtained structural prior distribution is as follows:

[0046] (μ structure ,∑ structure ) = MLP(u G )

[0047] In the formula, μ structure ,∑ structure These are the mean and variance of the prior structural distribution, respectively.

[0048] Furthermore, the local spherical coordinate system is composed of the focal atom, the two atoms closest to the focal atom, and the two atoms second closest to the focal atom; let the three atoms constituting the local spherical coordinate system be (f i ,f j ,f k ), f t For a new atom generated in a local spherical coordinate system, the position of the new atom is represented as: d is f t with f i The distance between them, θ is (f i ,f j ) and (f i ,f t The bond angle between ) For (f i ,f j ,f k ) and (f i ,f j ,f t The dihedral angle formed between () and ().

[0049] When the first atom is generated, the focal atom is selected from the protein receptor; when the generated atom already exists, the focal atom is selected from the already generated atom.

[0050] Furthermore, regarding the atom autoregressive flow model, in the stage of generating new atoms, the flow model samples from the prior structural distribution obtained from the reference ligand structure encoder and combines it with the encoded representation of the local focal atoms to generate the corresponding atom type and atom spatial location. Specific operations include:

[0051] Flow models sample from structural prior distributions After a series of affine transformations, the corresponding atom types, distances, bond angles, and dihedral angles are generated:

[0052]

[0053] in, The encoding representation of local focal atoms.

[0054] Furthermore, the pre-trained molecular generation model GraphSF is obtained according to the following training steps:

[0055] Construct training and testing sets for protein-ligand interaction pairs. Each protein-ligand interaction pair corresponds to three files, including a pdb file of the protein receptor, an sdf file of the ligand molecule interacting with the protein receptor, and a pdb file of the protein binding pocket.

[0056] Obtain the order of atoms in the ligand molecule and the local focal atom corresponding to each atom;

[0057] The following features were extracted from protein-ligand interaction pairs in the training set:

[0058] The atomic type characteristics of the protein receptor, the atomic type characteristics of the ligand, the coordinates of each atom in the Cartesian coordinate system, the atom closest to the ligand and the atom farthest from the ligand on the protein receptor, the three atomic information of the local spherical coordinate system constructed when placing each atom, and the positional characteristics of each atom in the local spherical coordinate system; the positional characteristics include distance, bond angle and dihedral angle;

[0059] The process of ligand molecule generation is modeled as a sequence decision-making process. At the start of training, the input is a 3D radius map of the protein receptor. In each generation step, the molecular generation model GraphSF adds the atom type and spatial information of the new atoms based on the current 3D radius map.

[0060] The loss function during training is expressed as:

[0061]

[0062] In the formula, To generate the loss function for the atomic type of the flow model, The loss function for generating distance in the flow model. To generate the flow model loss function for the bond angle, The loss function for generating the dihedral flow model, Let be the loss function of the local focus atom predictor. Let β be the KL divergence between the distribution of the reference ligand encoded using a geometric structure encoder and the standard Gaussian distribution, where β is the coefficient of the KL divergence loss.

[0063]

[0064]

[0065]

[0066]

[0067] In the formula, Let be the determinant of the Jacobian matrix;

[0068] Loss function of local focus atom predictor Represented as:

[0069]

[0070]

[0071] In the formula, This represents the true class and the predicted class of the j-th node;

[0072] KL divergence between the distribution of the reference ligand encoded using a geometric structure encoder and the standard Gaussian distribution Represented as:

[0073]

[0074] Furthermore, the specific operations for constructing the training and test sets for protein-ligand interaction pairs include:

[0075] Obtain the dataset crossdock2020;

[0076] The crossdock2020 dataset was filtered, including removing datasets where the bounding conformation RMSD was greater than a certain value. The data was first clustered using mmseqs2 with 30% sequence consistency to obtain a protein-ligand interaction pair dataset.

[0077] A portion of the protein-ligand interaction pairs dataset was randomly selected as the training set, and another portion was selected as the test set.

[0078] Beneficial effects: Compared with the prior art, the present invention has the following advantages:

[0079] (1) The method of the present invention solves the problem of generating molecules with undesirable structures by current molecular generation methods. It establishes a protein-ligand complex dataset; establishes a molecular generation model GraphSF based on protein structure and variable current; trains the molecular generation model GraphSF based on protein structure and variable current; and generates target molecules based on the trained molecular generation model GraphSF, resulting in more reasonable molecular structures.

[0080] (2) The method of the present invention uses the geometric structure of the ligand molecule as a basis for sampling the generation model, which improves the rationality of the atomic type, interatomic distance, bond angle and dihedral angle distribution of each molecule in the generation process. Attached Figure Description

[0081] Figure 1 This is a flowchart of the present invention;

[0082] Figure 2 This is a structural diagram of GraphSF, the molecular generation model proposed in this invention. Detailed Implementation

[0083] The technical solution of the present invention will now be further described in conjunction with the accompanying drawings and embodiments.

[0084] Example 1:

[0085] like Figure 1 As shown, this embodiment discloses a molecular generation method based on protein structure using a variational current model, which mainly includes the following steps:

[0086] Step 1: Establish a protein-ligand complex dataset; including:

[0087] Obtain the protein-ligand complex dataset crossdock2020;

[0088] The protein-ligand complex dataset crossdock2020 was filtered to obtain a more refined dataset. The filtering included removing datasets from the crossdock2020 dataset where the binding conformation RMSD was greater than a certain value. The data was then clustered using mmseqs2 with 30% sequence consistency. Finally, 100,000 protein-ligand complex pairs were randomly selected from the clusters as the training set, and 100 protein-ligand complex pairs were selected from the remaining clusters as the test set.

[0089] Step 2: Preprocess the established protein-ligand complex dataset, including: using Biopython and RDkit tools to obtain information from the protein PDB file and the SDF file of its interacting ligands, respectively. This information includes the atom types and Cartesian coordinates of the protein and the atom types and Cartesian coordinates of the ligands. After processing, the atom types in the protein were determined to be: carbon, nitrogen, oxygen, sodium, magnesium, phosphorus, sulfur, chlorine, potassium, calcium, manganese, cobalt, copper, zinc, selenium, cadmium, iodine, cesium, mercury, and barium; the atom types in the ligands were determined to be: carbon, nitrogen, oxygen, phosphorus, sulfur, and chlorine.

[0090] The Prim's algorithm is used to obtain the order of atoms in the ligand molecule and the local focal atom corresponding to each atom.

[0091] The protein-ligand complex is decomposed into a process of gradually placing ligand atoms in the protein space. Assuming that the ligand in a certain protein-ligand complex data consists of M atoms, then the protein-ligand complex data can be represented as the generation steps of M atoms. Each step uses protein atoms, currently generated ligand atoms, local focal atoms, the two atoms closest and second closest to the local focal atom, the distance of the new atom in the local spherical coordinate system, bond angles, and dihedral angles.

[0092] Step 3: Preprocess the ligands in the established protein-ligand complex dataset to obtain the atom type features (i.e., the atom types in the protein and ligands mentioned in Step 2), Cartesian coordinate features (the Cartesian coordinates are the coordinates of each atom on the x, y, and z axes in the Cartesian coordinate system), the atom closest to the ligand and the atom farthest from the ligand on the protein receptor, the three atom information in the local spherical coordinate system constructed when placing each atom, the distance of each atom in the local spherical coordinate system, and the bond angle and dihedral angle features of each atom in the local spherical coordinate system.

[0093] Step 4: As Figure 2As shown, a molecular generative model GraphSF based on protein structure and variational flow is established. The molecular generative model GraphSF includes a protein-ligand complex encoder, a reference ligand structure encoder, a focal-atom classifier, and an atom flow model.

[0094] The protein-ligand complex encoder uses a variant of SchNet to extract the geometric features of the protein-ligand complex. This encoder includes an embedding layer and an interaction layer.

[0095] For a protein-ligand complex dataset, assuming the protein has p atoms and the ligand has m atoms, the dataset can be represented as a 3D radius map. A cutoff distance r is set. First, the Euclidean distance between each atom at time t is defined. Atoms with an Euclidean distance less than r are considered to have an edge relationship.

[0096] This embodiment uses an autoregressive model to generate atoms step by step. When generating atoms at time t, the generated atoms are represented as a 3D radius map:

[0097] G p+t-1 ={a1,r1,a2,r2,a3,r3,......,a p+t-1 ,r p+t-1}

[0098] In the formula, a1, a2, ..., a p+t-1 This indicates that the atom types generated at time t are r1, r2, ..., r p+t-1 This indicates the spatial location of the atoms that have been generated at time t.

[0099] After the embedding layer, the element type of the atom is embedded into the vector space.

[0100] h1, h2, h3, ..., h p+t-1 =Embedding(a1,a2,a3,...,a p+t-1 )

[0101] In the formula, h p+t-1 Let represent the embedding vector of the p+t-1th atom in the protein-ligand complex at time t.

[0102] After passing through the Interaction layer, the atomic representation of the molecule is updated using geometric coordinates. The Interaction layer contains three atom-wise layers, one shifted softplus activation layer, and one continuous filtering convolutional layer (cfconv). The order is: atom-wise layer, cfconv layer, atom-wise layer, shifted softplus layer, and atom-wise layer.

[0103] In a cfconv layer, the edge length e in the radius graph is first calculated. weight :

[0104] e weight =||r i -r j ||

[0105] In the formula, r i r j These represent the spatial positions of the i-th atom and the j-th atom, respectively.

[0106] Then, the edge lengths are embedded using Gaussian radial basis functions and forward propagated through a fully connected layer:

[0107] W l =MLP l (e RBF (e weight ))

[0108] In the formula, MLP l Represents the l-th fully connected layer, e RBF (e weight ) represents the side length e weight High-dimensional vector representation;

[0109] Then, message passing and message updates are performed on the hidden variables of each node.

[0110]

[0111] In the formula, The node with index k represents the neighboring nodes of the node in the graph. Nodes whose distance from node k is less than the cutoff distance are all neighbors. ⊙ represents the Hadamard product between matrices. This represents the hidden variable obtained after passing through the Embedding layer and then through l-1 Interaction layers.

[0112] The reference ligand structure encoder enables embedding of the reference ligand structure. Here, the reference ligand refers to the ligand in the protein-ligand complex. A variant of SphereNet is used to extract the geometric structural features of the reference ligand involving bond lengths, bond angles, and dihedral angles. Simultaneously, SphereNet's spherical coordinate message passing method is used to encode the structural information of the reference ligand into a structural prior distribution. A reference ligand with m atoms is represented as a 3D radius map.

[0113] G m ={(a1,r1),(a2,r2),(a3,r3),......,(a m ,r m )}

[0114] In the formula, (a m ,r m ) represent the element type and spatial position of the m-th atom, respectively.

[0115] This step involves encoding a 3D radius map of the reference ligand using SphereNet:

[0116] u G =SphereNet(G m )

[0117] The detailed process is as follows: for this 3D radius map, use e k The characteristic of the k-th edge is represented by v. i This represents the characteristics of node i. SphereNet uses Sphere Message Passing to update it to e'. k and v' i :

[0118]

[0119]

[0120] In the formula, and These represent two message update functions, ρ p→e (·) and ρ e→v (·) represents the message aggregation function, r k and s k Let be the outgoing and incoming nodes of the k-th edge, respectively. Represents node s k All neighboring nodes, and E i It is related to node s k The node characteristics associated with node i.

[0121] Spherenet is an invariant model that uses three geometric structural information during message passing: interatomic distances, bond angles, and dihedral angles. These three structural information do not change with the translation, rotation, or mirroring of the molecule. After sphere message passing, the features of each atom node in the reference ligand are updated, including the geometric structural features with surrounding nodes. Then, the tensors of all nodes are summed to obtain the overall embedding tensor of the molecular graph.

[0122]

[0123] In the formula, This represents the characteristic tensor of the i-th atom in the reference ligand. The characteristic tensor representing the entire molecular graph;

[0124] Finally, a multilayer perceptron is used to predict the mean and variance of the structural prior distribution:

[0125] (μ structure ,∑ structure ) = MLP(u G )

[0126] In the formula, μ structure ,∑ structure , respectively, are the mean and variance of the structural prior distribution, which is used to constrain the sampling space of the atomic autoregressive flow model during model training.

[0127] The goal of the local focal atom predictor is to perform binary classification of atoms, determining which focal atom constitutes the local spherical coordinate system in which the atom is located when the atom is generated at time t.

[0128] There are two cases when predicting the focal atom: t = 1 and t ≥ 2. When t = 1, it indicates that the model is generating the first atom of the ligand, and the focal atom will be selected from the protein receptor. When t ≥ 2, it indicates that the model has generated t-1 ligand atoms, and the focal atom will be selected from the t-1 atoms that have already been generated. Simultaneously, in order to construct the local spherical coordinate system, two additional atoms are needed besides the focal atom. Therefore, the two atoms closest to and second closest to the focal atom are selected to form the local spherical coordinate system. Let the three atoms constituting the local spherical coordinate system be (f...). i ,f j ,f k ), f t For a new atom generated in a local spherical coordinate system, the position of the new atom is represented as: d is f t with f i The distance between them, θ is (f i ,f j ) and (f i,f t The bond angle between ) For (f i ,f j ,f k ) and (f i ,f j ,f t The dihedral angle formed between ) and .

[0129] Among them, the atom autoregressive flow model is used to realize atom generation based on autoregressive flow. It models the process of a given protein generating its new ligand molecule as a Markov sampling process, which mainly includes four flow models with non-shared weights, namely the flow models for generating atom types. Flow model for generating distance Flow model for generating bond corners and the flow model that generates dihedrals During the model training phase, a series of reversible transformations are used to transform the data distribution into the latent space, f -1 : During the model generation phase, new data is generated by sampling from the prior distribution using the inverse function of the training phase, f:

[0130] The detailed steps are as follows:

[0131] The flow model is derived from the prior distribution of the structure. After a series of affine transformations, the corresponding atom types, distances, bond angles, and dihedral angles are generated, namely:

[0132]

[0133] in, The encoding representation of local focal atoms.

[0134] The detailed steps are as follows:

[0135] In the process of sampling an atom, a local spherical coordinate system is first constructed using the predicted focal atom and the two atoms (f, s, n) closest and second closest to the focal atom. In the spherical coordinate system, random variables z are sampled from the prior distribution by four flow models. t , for z t Perform the corresponding transformations to generate a new type of atom, a. t and its spatial coordinates a t ,d t ,θ t , They are generated sequentially. In this sequential generation process, information from the previously generated results is incorporated into each step of the generation process. For example, in generating θ... t At that time, the input to the model is a three-dimensional graph G.p+t-1 ,random variable and the a generated in the first two steps t and d t The generation process is as follows:

[0136]

[0137]

[0138]

[0139]

[0140] During training, observed variables x are mapped to latent variables, and their likelihoods are maximized. For atom type a t It is a discrete variable, which the flow model cannot handle, so a dequantization method was used to transform it into a continuous variable for atomic type a. t Add Gaussian noise to dediscretize it:

[0141]

[0142] u~U(0,1) p

[0143] For discrete variable a t The generation can be transformed into a continuous variable. The generation, After a series of affine transformations, the mapping is...

[0144]

[0145]

[0146]

[0147] The conclusion is Then, perform an argmax operation on it to obtain a. t .

[0148] The generation of spatial coordinates is similar to the generation of atom types:

[0149]

[0150]

[0151]

[0152] in, As a scale factor, These are translation factors, all of which depend on corresponding conditional information. To integrate the conditional information at each step, a multilayer perceptron is used to predict these factors by inputting the conditional information. The specific formula is as follows:

[0153]

[0154]

[0155]

[0156]

[0157]

[0158]

[0159]

[0160]

[0161]

[0162] Wherein, RBF is the radial basis function, used for high-dimensional embedding of distances between nodes; CBF is the circular basis function, used for high-dimensional embedding of distances and angles between nodes.

[0163] Step 5: Using the data obtained in Steps 2 and 3, train the molecular generative model GraphSF constructed in Step 4. Specific operations include:

[0164] The process of molecule generation is modeled as a sequence decision-making process. For each protein-ligand complex in the dataset, the input at the beginning of training is the 3D radius map of the protein. In each generation step, the model adds the atom type and spatial information of the new node based on the current 3D radius map.

[0165] The loss function during training is expressed as:

[0166]

[0167] The loss function consists of four parts. To generate the loss function for the atomic type of the flow model, The loss function for generating distance in the flow model. To generate the flow model loss function for the bond angle, The loss function for generating the dihedral flow model, This is the loss function for the local focus atom predictor.

[0168]

[0169]

[0170]

[0171]

[0172] In the formula, Let be the determinant of the Jacobian matrix.

[0173] Focus Atom Classification Loss It's a binary cross-entropy loss, which is divided into two cases during training. In the first case, we use protein atoms as a condition to generate the first ligand molecule. In this case, we define the atom closest to the ligand on the protein as a positive sample and the farthest atom as a negative sample. In the second case, at least one ligand molecule has already been generated. If all bonding atoms of an atom have been generated, we define it as a negative sample; otherwise, it's a positive sample. The specific calculation method of the loss function is as follows:

[0174]

[0175]

[0176] In the formula, This represents the true class and the predicted class of the j-th node;

[0177] To maximize the similarity between the geometric structure prior distribution and the standard Gaussian prior distribution, a regularization loss term is added. Where β is a hyperparameter that adjusts the weight of this loss in the total loss. The KL divergence between the distribution of the reference ligand encoded using a geometric structure encoder and the standard Gaussian distribution is expressed by the following formula:

[0178]

[0179] Step 6: Using the molecular generation model GraphSF trained in Step 5, input the protein PDB format file to generate the structure of the corresponding ligand molecule. The validity of each ligand molecule is verified using the rdkit toolkit, invalid ligand molecules are removed, and the final ligand molecule structure is retained for evaluation. The evaluation metrics used are novelty, diversity, validity, synthetic accessibility score (SA), quantitative evaluation of drug-likeness score (QED), Linpinski's five principles of drug-likeness score, and n-octanol-water partition coefficient.

[0180] Example 2:

[0181] This embodiment proposes a molecular generation method based on protein structure using a variational current model, which mainly includes the following steps:

[0182] Step 1: Download the Crossdock2020 protein-ligand complex dataset, which contains approximately 22,500,000 protein-ligand interaction pairs. Each protein-ligand complex in the dataset corresponds to three files: a pdb file for the protein receptor, an sdf file for the small molecule ligand, and a pdb file for the protein binding pocket.

[0183] Step 2: Process the Crossdock2020 dataset to obtain 100,000 protein-ligand complex pairs as the training set and 100 protein-ligand complex pairs as the test set. The training set and the test set data are not duplicated.

[0184] Step 3: Extract features from the protein-ligand complex and the reference ligand respectively, and model the ligand in the protein-ligand complex as a series of sequentially placed atoms. Taking the protein receptor with pdbid=4YHJ and its binding ligand in the dataset as an example, the protein receptor has 8101 atoms and the binding ligand has 27 atoms. The protein-ligand complex can be represented as containing 27 atom generations. Each generation is based on the protein receptor and the generated atoms. The atom type of the protein-ligand complex can be represented as a feature vector of size (8101+(8101+1)+(8101+2)+......+(8101+27-1))×1, and the atom spatial position can be represented as a feature vector of size (8101+(8101+1)+(8101+2)+......+(8101+27-1))×3. For each placed atom, there are three related atoms that construct a local sphere. The coordinate system calculates distances, bond angles, and dihedral angles. For each distance calculation, the feature vector of the focal atom (8101+(8101+1)+(8101+2)+......+(8101+27-1))×1 is used. For the bond angle calculation, the feature vector of the focal atom and the nearest atom (8101+(8101+1)+(8101+2)+......+(8101+27-1))×2 is used. For the dihedral angle calculation, the feature vector of the focal atom and the nearest and second-nearest atoms (8101+(8101+1)+(8101+2)+......+(8101+27-1))×3 is used. At the same time, the atom type of the ligand is represented by a 27×1 feature vector, and the atom spatial position is represented by a 27×3 feature vector.

[0185] Step 4: Set the hyperparameters of the model, input the extracted protein-ligand features and reference ligand features into the molecular generative model GraphSF for training, and iterate continuously as required until the molecular generative model GraphSF reaches the optimal effect.

[0186] Step 5: Save the training weights of the GraphSF molecular generative model. Obtain the protein PDB file of the molecule for which you want to predict the active ligand. Assume that the protein receptor contains p atoms. Extract the atom type features (p×1) and atom spatial position features (p×3) of the protein receptor. Input the extracted features into the GraphSF molecular generative model. The GraphSF molecular generative model generates a number of atoms. Using an algorithm, bonds are added according to the interatomic distances. Finally, the 3D molecule is output as an SDF file.

[0187] Step 6: Visualize the results. Use PyMol software to open the protein PDB file and the generated 3D molecule SDF file to display the spatial interaction positions of the generated molecule and the protein receptor.

[0188] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.

[0189] The embodiments described above are merely illustrative of several implementation methods of this application, and while the descriptions are relatively specific and detailed, they should not be construed as limiting the scope of the invention patent. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of this application, and these all fall within the protection scope of this application. Therefore, the protection scope of this patent application should be determined by the appended claims.

Claims

1. A method for molecular generation using a variational flow model based on protein structure, characterized by: Includes the following steps: Step 1: Obtain the pdb format file of the protein receptor for which ligand molecules need to be generated; Step 2: Assuming that the protein receptor that needs to generate the ligand molecule includes atom types and atomic spatial position features of the protein receptor are extracted; The atomic type feature is a feature vector of size The atomic spatial position feature is a feature vector of size The atomic spatial position includes: distance, bond angle and dihedral angle; Step 3: Input the extracted atom type features and atom spatial position features of the protein receptor into the pre-trained molecular generation model GraphSF to generate several atoms; Step 4: Add bonds based on the interatomic distances, and finally output the corresponding ligand molecule in SDF format; The molecular generation model GraphSF includes: a protein-ligand complex encoder, a reference ligand structure encoder, a local focal atom predictor, and an atom autoregressive flow model; The protein-ligand complex encoder is a variant of SchNet, comprising: an embedding layer and an interaction layer; Assuming that for a protein-ligand interaction pair, the protein receptor has The ligand molecule has m atoms. At any given time, if the Euclidean distance between two atoms is less than a set cutoff distance, then the two atoms are considered to have an edge relationship. Assume that, when generating atoms at time t, the atoms already generated can form the following 3D radius diagram: ; In the formula, represents the atomic type generated at time t, represents the spatial position of the atom generated at time t; After the embedding layer, the atomic type is embedded into the vector space, represented as: ; wherein represents the embedding vector of the (p + t - 1)th atom in the protein ligand complex at time t; The Interaction layer sequentially includes a first atom-wise layer, a continuous filtering convolutional layer, a second atom-wise layer, an activation layer, and a third atom-wise layer; wherein, in the continuous filtering convolution layer, the edge length in the 3D radius map is first calculated : ; wherein respectively represent the spatial position of the i-th atom and the j-th atom. The edge length is then embedded using Gaussian radial basis functions and forward propagated through a fully connected layer: ; In the formula, denotes the layer fully connected layer, denotes the side length high-dimensional vector representation; Then, message passing and message updates are performed on the hidden variables of each atom, as follows: ; In the formula, This represents the neighboring atoms of atom k in the 3D radius diagram. Any point at the origin whose distance from atom k is less than the cutoff distance is considered a neighboring atom. Represents the Hadamard product between matrices. This indicates that after passing through the Embedding layer, it then passes through... Hidden variables obtained from the Interaction layer; The reference ligand structure encoder is a variant of SphereNet, used to encode the structural information of ligand molecules in protein-ligand interaction pairs into a structural prior distribution using SphereNet's spherical coordinate message passing method. The local focal atom predictor is used to perform binary classification of atoms, and to predict which focal atom constitutes the local spherical coordinate system in which the atom is located when the atom is generated at time t. The process of a protein receptor generating its new ligand molecules is modeled as a Markov sampling process. The atom autoregressive flow model includes four flow models with non-shared weights: a flow model for generating atom types, a flow model for generating distances, a flow model for generating bond angles, and a flow model for generating dihedral angles. In the stage of generating new atoms, the flow model samples from the prior structural distribution obtained from the reference ligand structure encoder and combines it with the encoded representation of local focal atoms to generate the corresponding atom types and atom spatial positions.

2. The method of claim 1, wherein: For the reference ligand structure encoder, the SphereNet spherical coordinate message passing method encodes the structural information of the ligand molecule in the protein-ligand interaction pair into a priori structural distribution. Specific operations include: Representing a ligand molecule with m atoms as a 3D radius diagram: ; In the formula, respectively represent the element type and spatial position of the mth atom. 3D radius map of the reference ligand encoded using SphereNet: ; Using SphereNet's spherical coordinate message passing method, the edge features and atom features on the 3D radius map of the reference ligand are updated: ; ; In the formula, and These represent two message update functions. and For message aggregation functions, To represent the characteristics of the k-th edge, use This represents the characteristics of atom i. and These are the outgoing and incoming atoms of the k-th edge, respectively. Let the characteristics of the outgoing and incoming atoms of the k-th edge be represented respectively. Represents atoms All neighboring origin points, and To be with atoms Atomic features associated with atom i; and Let represent the updated features of the k-th edge and the feature of atom i, respectively. Summing the tensors of all nodes yields the overall embedding tensor of the molecular graph: ; wherein denotes the characteristic tensor of the i-th atom in the reference ligand, denotes the characteristic tensor of the whole molecular graph; Finally, a multilayer perceptron is used to predict the mean and variance of the structural prior distribution: ; wherein are the mean and variance of the structural prior distribution, respectively.

3. The method of claim 1, wherein: The local spherical coordinate system is composed of the focal atom, the two atoms closest to the focal atom, and the two atoms second closest to the focal atom; let the three atoms constituting the local spherical coordinate system be... , For a new atom generated in a local spherical coordinate system, the position of the new atom is represented as: , for and The distance between them for and The bond angle between them for and The dihedral angle formed between them; When the first atom is generated, the focal atom is selected from the protein receptor; when the generated atom already exists, the focal atom is selected from the already generated atom.

4. The molecular generation method based on protein structure using a variational current model according to claim 2, characterized in that: For the autoregressive flow model of atoms, in the stage of generating new atoms, the flow model samples the prior structural distribution obtained from the reference ligand structure encoder and combines it with the encoded representation of the local focal atoms to generate the corresponding atom type and atom spatial location. Specific operations include: The flow model samples from a structural prior distribution After a series of affine transformations, the corresponding atom types, distances, bond angles, and dihedral angles are generated: ; wherein represents a coding representation of a local focus atom.

5. The method of claim 1, wherein: The pre-trained molecular generation model GraphSF is obtained according to the following training steps: Construct training and testing sets for protein-ligand interaction pairs. Each protein-ligand interaction pair corresponds to three files, including a pdb file of the protein receptor, an sdf file of the ligand molecule interacting with the protein receptor, and a pdb file of the protein binding pocket. Obtain the order of atoms in the ligand molecule and the local focal atom corresponding to each atom; The following features were extracted from protein-ligand interaction pairs in the training set: The atomic type characteristics of the protein receptor, the atomic type characteristics of the ligand, the coordinates of each atom in the Cartesian coordinate system, the atom closest to the ligand and the atom farthest from the ligand on the protein receptor, the three atomic information of the local spherical coordinate system constructed when placing each atom, and the positional characteristics of each atom in the local spherical coordinate system; the positional characteristics include distance, bond angle and dihedral angle; The process of ligand molecule generation is modeled as a sequence decision-making process. At the start of training, the input is a 3D radius map of the protein receptor. In each generation step, the molecular generation model GraphSF adds the atom type and spatial information of the new atoms based on the current 3D radius map. The loss function during training is expressed as: ; In the formula, To generate the loss function for the atomic type of the flow model, The loss function for generating distance in the flow model. To generate the flow model loss function for the bond angle, The loss function for generating the dihedral flow model, Let be the loss function of the local focus atom predictor. The KL divergence between the distribution of the reference ligand encoded using a geometric structure encoder and the standard Gaussian distribution. The coefficient of the KL divergence loss; ; ; ; ; wherein is the determinant of the Jacobian matrix; Loss function for local focus atom predictor is represented as: ; ; In the formula, represents the true class and the predicted class of the jth node. kl divergence between the distribution of reference ligands encoded using a geometric encoder and a standard gaussian distribution is represented as: 。 6. The method of claim 5, wherein: The specific operations for constructing the training and test sets for protein-ligand interaction pairs include: Obtain the dataset crossdock2020; The crossdock2020 dataset was filtered, including removing data with binding conformation RMSD greater than 1 Å. In the first stage, mmseqs2 was used to cluster the data with 30% sequence consistency to obtain the protein-ligand interaction pair dataset. A portion of the protein-ligand interaction pairs dataset was randomly selected as the training set, and another portion was selected as the test set.