Method, device and computer program product for conformation generation optimization
By integrating a conformation generation model with a physical model for energy-weighted training and all-atom diffusion, the method addresses the limitations of existing methods, achieving improved protein conformation generation for diverse applications.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- BEIJING YOUZHUJU NETWORK TECH CO LTD
- Filing Date
- 2024-12-20
- Publication Date
- 2026-06-25
AI Technical Summary
Current methods for protein conformation generation, such as molecular dynamics simulations and deep-learning models, face challenges in accuracy and computational cost, and lack a scalable approach to incorporate physical information during training, limiting the ability to generate diverse and accurate protein conformations.
A method that integrates a conformation generation model with a physical model to generate and train on self-generated sample conformations with physical information, using energy-weighted training and all-atom diffusion processes to improve model performance.
This approach allows for more accurate and efficient protein conformation generation without relying on extensive simulated data, enhancing applications in structural biology, drug development, and bioengineering by leveraging deep generative models.
Smart Images

Figure CN2024141173_25062026_PF_FP_ABST
Abstract
Description
METHOD, DEVICE AND COMPUTER PROGRAM PRODUCT FOR CONFORMATION GENERATION OPTIMIZATIONFIELD
[0001] The present disclosure generally relates to the field of computers, and more particularly, to a method, device, and computer program product for conformation generation optimization.BACKGROUND
[0002] Proteins are dynamic molecules that often exhibit multiple three-dimensional structures, known as conformations, in physiological environments. This flexibility allows proteins to undergo significant conformational changes to perform various functions, such as binding small molecules, transporting substances, and catalyzing chemical reactions. A comprehensive understanding of protein conformations facilitates the elucidation of biological reaction mechanisms, thereby empowering researchers to design targeted inhibitors and therapeutic agents with improved specificity and efficacy.SUMMARY
[0003] In a first aspect of the present disclosure, there is provided a method of conformation generation optimization. The method includes generating, by using a conformation generation model, a first sample conformation of a target protein; generating, by using a physical model, a second sample conformation with physical information of the target protein based on the first sample conformation, wherein the physical information indicates a reference distribution of atoms in the target protein; and training the conformation generation model with the second sample conformation.
[0004] In a second aspect of the present disclosure, there is provided an electronic device. The electronic device includes: a computer processor coupled to a computer-readable memory unit, the memory unit including instructions that when executed by the computer processor implements a method according to the first aspect of the present disclosure.
[0005] In a third aspect of the present disclosure, there is provided a computer program product, the computer program product including a computer readable storage medium having program instructions embodied therewith, the program instructions executable by an electronic device to cause the electronic device to perform a method according to the first aspect of the present disclosure.
[0006] This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.BRIEF DESCRIPTION OF THE DRAWINGS
[0007] Through the more detailed description of some embodiments of the present disclosure in the accompanying drawings, the above and other objects, features, and advantages of the present disclosure will become more apparent, wherein the same reference generally refers to the same components in the embodiments of the present disclosure.
[0008] FIG. 1 illustrates an example environment in which example embodiments of the present disclosure can be implemented;
[0009] FIG. 2A illustrates a schematic diagram of an example architecture for conformation generation optimization according to some embodiments of the present disclosure;
[0010] FIG. 2B illustrates a schematic diagram of an example process of conformation generation optimization according to some embodiments of the present disclosure;
[0011] FIG. 3 illustrates an example flowchart of a method of conformation generation optimization according to some embodiments of the present disclosure; and
[0012] FIG. 4 illustrates a block diagram of an electronic device in which various embodiments of the present disclosure can be implemented.DETAILED DESCRIPTION
[0013] Principle of the present disclosure will now be described with reference to some embodiments. It is to be understood that these embodiments are described only for the purpose of illustration and help those skilled in the art to understand and implement the present disclosure, without suggesting any limitation as to the scope of the disclosure. The disclosure described herein can be implemented in various manners other than the ones described below.
[0014] In the following description and claims, unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skills in the art to which this disclosure belongs.
[0015] References in the present disclosure to “one embodiment, ” “an embodiment, ” “an example embodiment, ” and the like indicate that the embodiment described may include a particular feature, structure, or characteristic, but it is not necessary that every embodiment includes the particular feature, structure, or characteristic. Moreover, such phrases are not necessarily referring to the same embodiment. Further, when a particular feature, structure, or characteristic is described in connection with an example embodiment, it is submitted that it is within the knowledge of one skilled in the art to affect such feature, structure, or characteristic in connection with other embodiments whether or not explicitly described.
[0016] It shall be understood that although the terms “first” and “second” etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, a first element could be termed a second element, and similarly, a second element could be termed a first element, without departing from the scope of example embodiments. As used herein, the term “and / or” includes any and all combinations of one or more of the listed terms.
[0017] The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments. As used herein, the singular forms “a” , “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises” , “comprising” , “has” , “having” , “includes” and / or “including” , when used herein, specify the presence of stated features, elements, and / or components etc., but do not preclude the presence or addition of one or more other features, elements, components and / or combinations thereof.
[0018] Principle of the present disclosure will now be described with reference to some embodiments. It is to be understood that these embodiments are described only for the purpose of illustration and help those skilled in the art to understand and implement the present disclosure, without suggesting any limitation as to the scope of the disclosure. The disclosure described herein can be implemented in various manners other than the ones described below. In the following description and claims, unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skills in the art to which this disclosure belongs.
[0019] It may be understood that data involved in the present technical solution (including but not limited to the data itself, the acquisition or use of the data) should comply with requirements of corresponding laws and regulations and relevant rules.
[0020] It may be understood that, before using the technical solutions disclosed in various embodiments of the present disclosure, the user should be informed of the type, scope of use, and use scenario of the information involved in the present disclosure in an appropriate manner in accordance with relevant laws and regulations, and the user’s authorization should be obtained.
[0021] For example, in response to receiving an active request from the user, prompt information is sent to the user to explicitly inform the user that the requested operation will need to acquire and use the user’s information. Therefore, the user may independently choose, according to the prompt information, whether to provide the information to software or hardware such as electronic devices, applications, servers, or storage media that perform operations of the technical solutions of the present disclosure.
[0022] As an optional but non-limiting implementation, in response to receiving an active request from the user, the way of sending prompt information to the user, for example, may include a pop-up window, and the prompt information may be presented in the form of text in the pop-up window. In addition, the pop-up window may also carry a selection control for the user to choose “agree” or “disagree” to provide the information to the electronic device.
[0023] It may be understood that the above process of notifying and obtaining the user authorization is only illustrative and does not limit the implementation of the present disclosure. Other methods that satisfy relevant laws and regulations are also applicable to the implementation of the present disclosure.
[0024] As used herein, the term “model” is referred to as an association between an input and an output learned from training data, and thus a corresponding output may be generated for a given input after the training. The generation of the model may be based on a machine learning technique. In general, a machine learning model may be built, which receives input information and makes predictions based on the input information. For example, a classification model may predict a class of input information among a predetermined set of classes. As used herein, “model” may also be referred to as “machine learning model” , “learning model” , “machine learning network” , or “learning network, ” which are used interchangeably herein.
[0025] As mentioned above, understanding conformational dynamics is crucial for studying the structural and functional properties of proteins. Traditional approaches, such as molecular dynamics (MD) simulations, are considered the “gold standard” for studying protein conformational changes at atomic resolution. In MD simulations, a carefully designed physical model (also referred to as force field) is used to describe atomic interactions based on Newtonian mechanics, and a numerical integrator simulates structural movements to sample conformations through Langevin dynamics. However, applying MD simulations for conformational analysis in biological and therapeutic contexts faces several challenges. For example, accurate force field models require complex parameter tuning and expertise to accommodate different molecular types. The need for femtosecond time steps for numerical stability makes simulating biologically relevant processes, which occur over microseconds to seconds, computationally expensive and time-consuming. Achieving equilibrium conformational sampling via Langevin dynamics demands long mixing times, which is challenging with current computational resources in real-world applications.
[0026] Some solutions utilize deep-learning folding models, which can accurately predict protein structures and interactions at the atomic level yet are not tailored for sampling diverse conformations. Previous works have extended folding models using diffusion-based or flow-based generative models to achieve conformational sampling. However, their effectiveness is limited by the scarcity of conformational data. These models primarily rely on databases, such as a protein data bank (PDB) , which contain static atomic structures derived from experimental methods. While such data enables high-accuracy predictions of folded structures, it falls short in capturing the full range of conformational states or dynamic changes.
[0027] Given that, traditional simulation methods using physical models are limited by accuracy and computational cost, and deep-learning models have not performed well on protein conformation prediction. Additionally, there is also the problem of difficulty in obtaining protein conformation data. For example, a certain existing dataset contains 1, 390 diverse proteins with triplicated 100 ns simulation data each. Another initiative attempted to scale up simulations to 12.6K proteins with up to 1 microsecond of simulation. While they offer valuable insights into protein conformations, the short simulation times and limited protein count may not suffice for models to fully capture protein dynamics from data alone.
[0028] Given the current limitations in protein conformation data, there is growing interest in incorporating physical information during model training, particularly for diffusion and flow models, to enhance performance. Several solutions have emerged. For example, in one solution, a diffusion model is trained with regularization derived from the Fokker-Planck equation to better align the learned score with atomic forces at small diffusion times. However, this solution requires costly calculations of the score function’s divergence, approximated using Hutchinson’s trace estimator. Additionally, directly supervising with error-prone “force values” introduces significant optimization challenges, requiring several approximations for stable training. In another solution, intermediate energies or forces are estimated during diffusion and physical guidance is applied at the sampling time. However, it requires additional neural networks to approximate intermediate energy / forces in order to avoid Monte Carlo-based estimation during sampling. So far, there is no simple and scalable method to train diffusion models with physical information like potential energy.
[0029] In view of the above, there are several technical problems to be solved. For example, current diffusion and flow models for protein conformation generation relies on training on protein structural data, and there is no easy and scalable approach to include physical information into the model training process to improve the model's ability to generate conformations. Accurate evaluation of protein energy requires predicting protein structures in fine-grained details, including backbone atoms and side-chain atoms of a protein molecule. Current methods generate backbone atoms first and then predict the coordinates of side-chain atoms through regression (non-generative) . This approach lacks "full generation" capability that only one type of side-chain structure is predicted given the backbone coordinates. Current models rely on existing experimental (PDB) or simulation (MD) datasets, limit model capability to further improve performance.
[0030] Embodiments of the present disclosure propose a solution for conformation generation optimization. The solution aims to at least partly solve the above problems by using data generated by the model itself with evaluation from physical models. In the solution, a first sample conformation of a target protein is generated by using a conformation generation model. A second sample conformation with physical information of the target protein is generated, by using a physical model, based on the first sample conformation. The physical information indicates a reference distribution of atoms in the target protein. The conformation generation model is trained with the second sample conformation.
[0031] In this way, protein conformation generation models are allowed to directly get feedback from physical models, avoiding the need for large amounts of simulated conformation data, improving the training of protein conformation generation models and benefiting multiple domains of research involving protein conformations, such as structural biology, drug development, and bioengineering.
[0032] Theoretical foundation of the present disclosure and example embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.
[0033] FIG. 1 illustrates a block diagram of an example environment 100 in which various embodiments of the present disclosure may be implemented. In the environment 100 of FIG. 1, a conformation generation model 110 is deployed in an electronic device 120. The electronic device 120 receives a sample protein sequence (e.g., chemical composition sequence) 101 of a target protein. The electronic device 120 uses the conformation generation model 110 to generate a target conformation 102 of the target protein based on the sample protein sequence 101.
[0034] To predict the target conformation 102, the conformation generation model 110 generates an initial conformation of the target protein based on the sample protein sequence 101. The initial conformation may be a conformation with noises, for example, sampling from a Gaussian distribution. The diffusion process is performed on the initial conformation and the target conformation 102 of the target protein may be determined based on a result of the diffusion process.
[0035] In the environment 100 of FIG. 1, the electronic device 120 may include any computing system with computing capability, such as various computing devices / systems, terminal devices, servers, etc. Terminal devices may include any type of mobile terminals, fixed terminals, or portable terminals, including mobile phones, desktop computers, laptops, netbooks, tablets, media computers, multimedia tablets, or any combination of the aforementioned, including accessories and peripherals of these devices or any combination thereof. Servers include but are not limited to mainframe, edge computing nodes, computing devices in cloud environment, etc.
[0036] Reference is now made to FIG. 2A, which illustrates a schematic diagram of an example architecture 200A for conformation generation optimization according to some embodiments of the present disclosure. The architecture 200A includes the conformation generation model 110 and a physical model 210. The conformation generation model 110 is configured to generate a first sample conformation 201 of a target protein. The conformation generation model 110 may include any suitable pretrained protein conformation generation model.
[0037] The physical model 210 is configured to generate a second sample conformation 202 with physical information of the target protein based on the first sample conformation. The physical model 210 may include any suitable protein physical model, such as MD force field model. The physical information of the target protein indicates a reference distribution of atoms in the target protein. In some embodiments, the physical information may include energy information of the target protein. For example, the physical model 210 may be applied to evaluate the energy of the first sample conformation 201. In some embodiments, the physical model 210 may derive, as the second sample conformation 202, a specific conformation from the first sample conformation. Alternatively, in some embodiments, the physical model 210 may evaluate the physical information (for example, energy) of the first sample conformation 201 and combine the first sample conformation 201 and the physical information as the second sample conformation.
[0038] By using the conformation generation model 110 and the physical model 210, the electronic device 120 obtains the second sample conformation, thereby training the conformation generation model 110 with the second sample conformation to obtain the improved conformation generation model 110. FIG. 2B illustrates a schematic diagram of an example process 200B of conformation generation optimization according to some embodiments of the present disclosure. The process 200B involves a self-training process to further improve model performance by integrating the benefits, such as classifier-free-based conformation exploration, energy-weighted training, and model distillation. The following will describe the process 200B in detail.
[0039] In some embodiments, the electronic device 110 may generate a first score value by using a protein sequence of the target protein as a guidance condition and generate a second score value in absence of a guidance condition. Then, the electronic device 110 may generate the first sample conformation 201 based on the first score value and the second score value. In this way, the diversity and accuracy of protein sample conformations generated can be balanced.
[0040] In some embodiments, the first score value may be generated by using the conformation generation model 110, and the second score value may be generated by using a further conformation generation model obtained based on the conformation generation model 110.
[0041] As an example, the conformation generation model 110 is represented as Sθ (Xt, t, c) , which is pretrained with classifier-free guidance. θ represents parameters of the conformation generation model 110. Xt represents the input of the conformation generation model 110. t represents a time step in the diffusion process, indicating the transition of data from a completely noisy state gradually back to its original data state. c represents a guidance condition, used to guide the generation process. For example, c may indicate the protein sequence (e.g., chemical composition sequence) of the target protein.
[0042] In detail, the electronic device 110 generates the first score value by using γSθ (Xt, t, c) . That is, the output of this model is generated at time step t given the input Xt and the guidance condition c. The electronic device 110 generates the second score value by using That is, the output of this model is generated at time step t given the input Xt without any guidance condition c. γ represents a condition strength, which is a parameter between 0 and 1 used to control the strength of guided sampling. As γapproaches 1, sampling is more dependent on the output of the classifier; as γ approaches 0, the sampling is closer to a random process without a classifier.
[0043] By using variable condition strength γ for conformation exploration, the sample conformations from the conformation generation model 110 may be sampled. Then, the electronic device 110 may combine the first score value and the second score value to obtain N first sample conformations 201, which are represented as The ith second sample conformation indicates the coordinate or conformation.
[0044] Taking energy information as an example of the physical information, if N first sample conformations 201 are provided to the physical model 210, N second sample conformations 202 are generated, respectively. The second sample conformations 202 with potential energy may be represented as That is, the ith second sample conformation with energy information is represented by a pair where indicates the coordinate or conformation, and represents the potential energy of the ith second sample conformation. N second sample conformations 202 are provided to the conformation generation model 110 for training.
[0045] In some embodiments, the electronic device 110 may determine a physical information weighted factor based on the physical information and a loss function based on the physical information weighted factor and a noising added to the second conformation sample. Taking energy information as an example of the physical information, the electronic device 110 may determine an energy weighted factor ω (X0) ∝exp (-βE (X0) ) with tuning parameter β and a loss function Then, the electronic device 110 may update the conformation generation model 110 based on the loss function
[0046] For example, diffusion models are commonly trained using denoising score-matching loss, without using energy information. The loss function may be defined as: where represents the loss function to minimize, X0 represents a sample conformation from a dataset t represents the diffusion time between [tmin, 1] , Xt represents noisy structured sampled from the forward transition kernel p (Xt∣X0) , represents conditional scores from the forward diffusion, s (Xt, t; θ) represents the score model parameterized by a neural network with weights θ, and λ (t) represents a prescribed weight schedule.
[0047] In some embodiments, the loss function may include a denoising score-matching loss function weighted based on energy, which is represented as Given that, the loss function with energy information through weighted denoising score-matching (EDSM) may be defined as: where ω (X0) ∝exp (-βE (X0) ) represents an energy weighted factor with tuning parameter β. In this way, through the energy-weighted training loss function, the physical information can be integrated into protein conformation generation model training.
[0048] Referring to FIG. 2B, the second sample conformations 202 with potential energy is represented as and then the denoising score-matching loss function may be further represented as:
[0049] By using energy-weighted denoising score-matching technique proposed, the conformation generation model 110 may be trained with self-generated sample conformations with energy information from the physical model 210, to obtain the improved conformation generation model In this way, through classifier-guidance, the physical model, and energy-weighted training, the proposed self-training process leverages self-generated conformations for protein conformation generation model training without using real-world data. Thus, the present disclosure has benefits for protein conformation generation models. For example, effectively integrating energy information into training allows models to learn an optimized distribution q (X) ∝p (X) exp (-βE (X) ) where sample conformations with higher ene rgy (less stable) are penalized.
[0050] Additionally, the present disclosure further proposes a novel all-atom diffusion process. In some embodiments, the first sample conformation 201 may include respective structures of backbone atoms of the target protein and respective structures of side-chain atoms of the target protein. In some embodiments, the conformation generation model 110 may be based on a diffusion process, and the diffusion process may be performed based on a change in: respective translations of backbone atoms of the target protein, respective rotations of the backbone atoms of the target protein, and respective torsional angles of side-chain atoms of the target protein.
[0051] For example, the all-atom diffusion process is performed in the joint space of SE (3) and SO (2) for generating backbone atoms and side-chain atoms for the target protein. Given the detailed protein conformation representation X0= [T0, R0, Ω0] , where represents translations of backbone atoms of the target protein, R0∈SO (3) L represents rotations of the backbone atoms of the target protein, and Ω0∈SO (2) Lⅹ5represents torsional angles of side-chain atoms of the target protein. The forward diffusion process may be defined as: where t represents the diffusion time variable in [0,1], βt, σt, and are predefined time-dependent noise schedules, P represents a projection operator removing the center of mass, and represents the standard Wiener process in SE (3) , SO (3) , and SO (2) space. And forward transition kernels may be defined as:
[0052] The associated reverse diffusion processes may be defined as:
[0053] By applying the diffusion process described above, the coordinates for both backbone atoms and side-chain atoms are able to be sampled through diffusion generative models, thereby obtaining the first sample conformation 201 including respective structures of backbone atoms of the target protein and respective structures of side-chain atoms of the target protein. In this way, the SE3-torsional diffusion models for all-atom conformation generation can be implemented, and accurate evaluation of protein energy can be implemented based on predicted protein structures in fine-grained details.
[0054] In view of the above, according to the embodiments of the present disclosure, protein conformation generation models are allowed to directly get feedback from physical models, avoiding the need for large amounts of simulated conformation data, improving the training of protein conformation generation models and benefiting multiple domains of research involving protein conformations, such as structural biology, drug development, and bioengineering. Further, improving the current model performance on protein conformation generation allows people to leverage deep generative models for protein conformation study in place of expensive protein dynamics simulation. Additional learning modules are not required to achieve energy-guided training. The present disclosure may use any sample conformations that can be generated from MD simulation or from deep learning models and apply to any physical energy including but not limited to potential energy calculated by any physical models.
[0055] FIG. 3 illustrates a flowchart of a method 300 of conformation generation optimization in accordance with some example implementations of the present disclosure. The method 300 may be implemented at the electronic device 120 as illustrated in FIG. 1. At a block 310, a first sample conformation of a target protein is generated by using a conformation generation model. At a block 320, a second sample conformation with physical information of the target protein is generated, by using a physical model, based on the first sample conformation. The physical information indicates a reference distribution of atoms in the target protein. At a block 330, the conformation generation model is trained with the second sample conformation.
[0056] In some embodiments, the method includes generating a first score value by using a protein sequence of the target protein as a guidance condition; generating a second score value in absence of a guidance condition; and generating the first sample conformation of the target protein based on the first score value and the second score value.
[0057] In some embodiments, the first score value is generated by using the conformation generation model, and the second score value is generated by using a further conformation generation model obtained based on the conformation generation model.
[0058] In some embodiments, the first sample conformation includes respective structures of backbone atoms of the target protein and respective structures of side-chain atoms of the target protein.
[0059] In some embodiments, the conformation generation model is based on a diffusion process, and the diffusion process is performed based on a change in: respective translations of backbone atoms of the target protein, respective rotations of the backbone atoms of the target protein, and respective torsional angles of side-chain atoms of the target protein.
[0060] In some embodiments, the method 300 includes determining a physical information weighted factor based on the physical information; determining a loss function based on the physical information weighted factor and a noising added to the second conformation sample; and updating the conformation generation model based on the loss function.
[0061] In some embodiments, the loss function includes a denoising score-matching loss function.
[0062] In some embodiments, the physical information includes energy information of the target protein.
[0063] FIG. 4 illustrates a block diagram of an electronic device 400 in which various embodiments of the present disclosure can be implemented. It would be appreciated that the electronic device 400 shown in FIG. 4 is merely for purpose of illustration, without suggesting any limitation to the functions and scopes of the present disclosure in any manner. The electronic device 400 may be used to implement the above method 400. As shown in FIG. 4, the electronic device 400 may be a general-purpose electronic device. The electronic device 400 may at least include one or more processors or processing units 410, a memory 420, a storage unit 430, one or more communication units 440, one or more input devices 450, and one or more output devices 460.
[0064] The processing unit 410 may be a physical or virtual processor and can implement various processes based on programs 425 stored in the memory 420. In a multi-processor system, multiple processing units execute computer executable instructions in parallel so as to improve the parallel processing capability of the electronic device 400. The processing unit 410 may also be referred to as a central processing unit (CPU) , a microprocessor, a controller, or a microcontroller.
[0065] The electronic device 400 typically includes various computer storage medium. Such medium can be any medium accessible by the electronic device 400, including, but not limited to, volatile and non-volatile medium, or detachable and non-detachable medium. The memory 420 can be a volatile memory (for example, a register, cache, Random Access Memory (RAM) ) , a non-volatile memory (such as a Read-Only Memory (ROM) , Electrically Erasable Programmable Read-Only Memory (EEPROM) , or a flash memory) , or any combination thereof. The storage unit 430 may be any detachable or non-detachable medium and may include a machine-readable medium such as a memory, flash memory drive, magnetic disk, or another other media, which can be used for storing information and / or data and can be accessed in the electronic device 400.
[0066] The electronic device 400 may further include additional detachable / non-detachable, volatile / non-volatile memory medium. Although not shown in FIG. 4, it is possible to provide a magnetic disk drive for reading from and / or writing into a detachable and non-volatile magnetic disk and an optical disk drive for reading from and / or writing into a detachable non-volatile optical disk. In such cases, each drive may be connected to a bus (not shown) via one or more data medium interfaces.
[0067] The communication unit 440 communicates with a further electronic device via the communication medium. In addition, the functions of the components in the electronic device 400 can be implemented by a single computing cluster or multiple computing machines that can communicate via communication connections. Therefore, the electronic device 400 can operate in a networked environment using a logical connection with one or more other servers, networked personal computers (PCs) or further general network nodes.
[0068] The input device 450 may be one or more of a variety of input devices, such as a mouse, keyboard, tracking ball, voice-input device, and the like. The output device 460 may be one or more of a variety of output devices, such as a display, loudspeaker, printer, and the like. By means of the communication unit 440, the electronic device 400 can further communicate with one or more external devices (not shown) such as the storage devices and display device, with one or more devices enabling the user to interact with the electronic device 400, or any devices (such as a network card, a modem, and the like) enabling the electronic device 400 to communicate with one or more other electronic devices, if required. Such communication can be performed via input / output (I / O) interfaces (not shown) .
[0069] In some embodiments, instead of being integrated in a single device, some, or all components of the electronic device 400 may also be arranged in cloud computing architecture. In the cloud computing architecture, the components may be provided remotely and work together to implement the functionalities described in the present disclosure. In some embodiments, cloud computing provides computing, software, data access and storage service, which will not require end users to be aware of the physical locations or configurations of the systems or hardware providing these services. In various embodiments, the cloud computing provides the services via a wide area network (such as Internet) using suitable protocols. For example, a cloud computing provider provides applications over the wide area network, which can be accessed through a web browser or any other computing components. The software or components of the cloud computing architecture and corresponding data may be stored on a server at a remote position. The computing resources in the cloud computing environment may be merged or distributed at locations in a remote data center. Cloud computing infrastructures may provide the services through a shared data center, though they behave as a single access point for the users. Therefore, the cloud computing architectures may be used to provide the components and functionalities described herein from a service provider at a remote location. Alternatively, they may be provided from a conventional server or installed directly or otherwise on a client device.
[0070] The functionalities described herein can be performed, at least in part, by one or more hardware logic components. For example, and without limitation, illustrative types of hardware logic components that can be used include Field-Programmable Gate Arrays (FPGAs) , Application-specific Integrated Circuits (ASICs) , Application-specific Standard Products (ASSPs) , System-on-a-chip systems (SOCs) , Complex Programmable Logic Devices (CPLDs) , and the like.
[0071] Program code for carrying out the methods of the subject matter described herein may be written in any combination of one or more programming languages. The program code may be provided to a processor or controller of a general-purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions / operations specified in the flowcharts and / or block diagrams to be implemented. The program code may be executed entirely or partly on a machine, executed as a stand-alone software package partly on the machine, partly on a remote machine, or entirely on the remote machine or server.
[0072] In the context of this disclosure, a machine-readable medium may be any tangible medium that may contain or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include but not limited to an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of the machine-readable storage medium would include an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random-access memory (RAM) , a read-only memory (ROM) , an erasable programmable read-only memory (EPROM or Flash memory) , an optical fiber, a portable compact disc read-only memory (CD-ROM) , an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
[0073] Further, while operations are illustrated in a particular order, this should not be understood as requiring that such operations are performed in the particular order shown or in sequential order, or that all illustrated operations are performed to achieve the desired results. In certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are contained in the above discussions, these should not be construed as limitations on the scope of the subject matter described herein, but rather as descriptions of features that may be specific to particular embodiments. Certain features that are described in the context of separate embodiments may also be implemented in combination in a single implementation. Rather, various features described in a single implementation may also be implemented in multiple embodiments separately or in any suitable sub-combination.
[0074] Although the subject matter has been described in language specific to structural features and / or methodological acts, it is to be understood that the subject matter specified in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.
[0075] From the foregoing, it will be appreciated that specific embodiments of the presently disclosed technology have been described herein for purposes of illustration, but that various modifications may be made without deviating from the scope of the disclosure. Accordingly, the presently disclosed technology is not limited except as by the appended claims.
[0076] Embodiments of the subject matter and the functional operations described in the present disclosure can be implemented in various systems, digital electronic circuitry, or in computer software, firmware, or hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer program products, i.e., one or more modules of computer program instructions encoded on a tangible and non-transitory computer readable medium for execution by, or to control the operation of, data processing apparatus. The computer readable medium can be a machine-readable storage device, a machine-readable storage substrate, a memory device, a composition of matter effecting a machine-readable propagated signal, or a combination of one or more of them. The term “data processing unit” or “data processing apparatus” encompasses all apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers. The apparatus can include, in addition to hardware, code that creates an execution environment for the computer program in question, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them.
[0077] A computer program (also known as a program, software, software application, script, or code) can be written in any form of programming language, including compiled or interpreted languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program does not necessarily correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data (e.g., one or more scripts stored in a markup language document) , in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub programs, or portions of code) . A computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network.
[0078] Processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer. Generally, a processor will receive instructions and data from a read only memory or a random access memory or both. The essential elements of a computer are a processor for performing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto optical disks, or optical disks. However, a computer need not have such devices. Computer readable media suitable for storing computer program instructions and data include all forms of nonvolatile memory, media, and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.
[0079] It is intended that the specification, together with the drawings, be considered exemplary only, where exemplary means an example. As used herein, the use of “or” is intended to include “and / or” , unless the context clearly indicates otherwise.
[0080] While the present disclosure contains many specifics, these should not be construed as limitations on the scope of any disclosure or of what may be claimed, but rather as descriptions of features that may be specific to particular embodiments of particular disclosures. Certain features that are described in the present disclosure in the context of separate embodiments can also be implemented in combination in a single implementation. Conversely, various features that are described in the context of a single implementation can also be implemented in multiple embodiments separately or in any suitable sub-combination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a sub-combination or variation of a sub-combination.
[0081] Similarly, while operations are illustrated in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. Moreover, the separation of various system components in the embodiments described in the present disclosure should not be understood as requiring such separation in all embodiments. Only a few embodiments and examples are described, and other embodiments, enhancements and variations can be made based on what is described and illustrated in the present disclosure.
Claims
1.A method of conformation generation optimization, comprising:generating, by using a conformation generation model, a first sample conformation of a target protein;generating, by using a physical model, a second sample conformation with physical information of the target protein based on the first sample conformation, wherein the physical information indicates a reference distribution of atoms in the target protein; andtraining the conformation generation model with the second sample conformation.2.The method of claim 1, wherein generating, by using the conformation generation model, the first sample conformation of the target protein comprises:generating a first score value by using a protein sequence of the target protein as a guidance condition;generating a second score value in absence of a guidance condition; andgenerating the first sample conformation of the target protein based on the first score value and the second score value.3.The method of claim 2, wherein the first score value is generated by using the conformation generation model, and the second score value is generated by using a further conformation generation model obtained based on the conformation generation model.4.The method of claim 1, wherein the first sample conformation comprises respective structures of backbone atoms of the target protein and respective structures of side-chain atoms of the target protein.5.The method of claim 1, wherein the conformation generation model is based on a diffusion process, and the diffusion process is performed based on a change in: respective translations of backbone atoms of the target protein, respective rotations of the backbone atoms of the target protein, and respective torsional angles of side-chain atoms of the target protein.6.The method of claim 1, wherein training the conformation generation model with the second sample conformation comprises:determining a physical information weighted factor based on the physical information;determining a loss function based on the physical information weighted factor and a noising added to the second conformation sample; andupdating the conformation generation model based on the loss function.7.The method of claim 6, wherein the loss function comprises a denoising score-matching loss function.8.The method of claim 1, wherein the physical information comprises energy information of the target protein.9.An electronic device, comprising a computer processor coupled to a computer-readable memory unit, the memory unit comprising instructions that when executed by the computer processor implements acts of conformation generation optimization, the acts comprising:generating, by using a conformation generation model, a first sample conformation of a target protein;generating, by using a physical model, a second sample conformation with physical information of the target protein based on the first sample conformation, wherein the physical information indicates a reference distribution of atoms in the target protein; andtraining the conformation generation model with the second sample conformation.10.The electronic device of claim 9, wherein generating, by using the conformation generation model, the first sample conformation of the target protein comprises:generating a first score value by using a protein sequence of the target protein as a guidance condition;generating a second score value in absence of a guidance condition; andgenerating the first sample conformation of the target protein based on the first score value and the second score value.11.The electronic device of claim 10, wherein the first score value is generated by using the conformation generation model, and the second score value is generated by using a further conformation generation model obtained based on the conformation generation model.12.The electronic device of claim 9, wherein the first sample conformation comprises respective structures of backbone atoms of the target protein and respective structures of side-chain atoms of the target protein.13.The electronic device of claim 9, wherein the conformation generation model is based on a diffusion process, and the diffusion process is performed based on a change in: respective translations of backbone atoms of the target protein, respective rotations of the backbone atoms of the target protein, and respective torsional angles of side-chain atoms of the target protein.14.The electronic device of claim 9, wherein training the conformation generation model with the second sample conformation comprises:determining a physical information weighted factor based on the physical information;determining a loss function based on the physical information weighted factor and a noising added to the second conformation sample; andupdating the conformation generation model based on the loss function.15.The electronic device of claim 14, wherein the loss function comprises a denoising score-matching loss function.16.The electronic device of claim 9, wherein the physical information comprises energy information of the target protein.17.A computer program product, the computer program product comprising a non-transitory computer readable storage medium having program instructions embodied therewith, the program instructions executable by an electronic device to cause the electronic device to perform acts of conformation generation optimization, the acts comprising:generating, by using a conformation generation model, a first sample conformation with of a target protein;generating, by using a physical model, a second sample conformation with physical information of the target protein based on the first sample conformation, wherein the physical information indicates a reference distribution of atoms in the target protein; andtraining the conformation generation model with the second sample conformation.18.The computer program product of claim 17, wherein generating, by using the conformation generation model, the first sample conformation of the target protein comprises:generating a first score value by using a protein sequence of the target protein as a guidance condition;generating a second score value in absence of a guidance condition; andgenerating the first sample conformation of the target protein based on the first score value and the second score value.19.The computer program product of claim 17, wherein training the conformation generation model with the second sample conformation comprises:determining a physical information weighted factor based on the physical information;determining a loss function based on the physical information weighted factor and a noising added to the second conformation sample; andupdating the conformation generation model based on the loss function.20.The computer program product of claim 17, wherein the physical information comprises energy information of the target protein.