Protein structure prediction

By extracting structural features of amino acid residues from a fragment library and optimizing protein structure prediction using a neural network model, the problem of ineffective use of fragment library information in existing technologies is solved, achieving more efficient and accurate protein structure prediction.

CN114694756BActive Publication Date: 2026-06-12MICROSOFT TECHNOLOGY LICENSING LLC

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
MICROSOFT TECHNOLOGY LICENSING LLC
Filing Date
2020-12-31
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing technologies fail to effectively utilize the structural information of fragment libraries in protein structure prediction, resulting in insufficient prediction accuracy, and the Monte Carlo simulation process is time-consuming.

Method used

By extracting structural features from multiple amino acid residues from a fragment library, a neural network model is used to predict the structure and structural properties of proteins. The gradient descent method is then used to optimize the structure prediction, thereby supplementing and improving the protein structure prediction information.

🎯Benefits of technology

It improves the accuracy and efficiency of protein structure prediction, reduces computation time, and provides more realistic protein structure prediction results.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN114694756B_ABST
    Figure CN114694756B_ABST
Patent Text Reader

Abstract

According to implementations of the present disclosure, a scheme for protein structure prediction is proposed. In the scheme, from a fragment library for a target protein, a plurality of fragments are determined for each of a plurality of residue positions of the target protein. Each fragment includes a plurality of amino acid residues. Then, for each residue position, a feature representation of structures of the plurality of fragments is generated. Next, based on the respective feature representations generated for the plurality of residue positions, a prediction of at least one of a structure and a structural property of the target protein is determined. The scheme is able to utilize structural information of the fragment library to complement and refine the information used in protein structure prediction, thereby improving the accuracy of protein structure prediction.
Need to check novelty before this filing date? Find Prior Art

Description

Background Technology

[0001] Proteins are biomolecules or macromolecules composed of long chains of amino acid residues. Proteins perform many vital life activities within organisms, and their functions are primarily determined by their three-dimensional (3D) structure. Knowing protein structure is crucial for the fields of medicine and biotechnology. For example, if a protein plays a key role in a disease, its structure can be used to design drug molecules to treat that disease. However, determining protein structure experimentally is extremely time-consuming, and the number of proteins whose structures can be experimentally determined is very small. Therefore, low-cost, high-yield protein structure prediction has become an important tool for protein structure research. Summary of the Invention

[0002] According to an implementation of this disclosure, a scheme for protein structure prediction is proposed. In this scheme, multiple fragments are determined from a fragment library targeting a target protein, for each residue position among multiple residue positions of the target protein. Each fragment includes multiple amino acid residues. Then, for each residue position, a feature representation of the structure of the multiple fragments is generated. Next, based on the corresponding feature representations generated for the multiple residue positions, a prediction of at least one of the structure and structural properties of the target protein is determined. In some implementations, the structure of the target protein can be predicted. In such implementations, structural information from the fragment library can facilitate the search for more realistic protein structures. In some implementations, the structural properties of the target protein can be predicted. In such implementations, structural information from the fragment library can improve the accuracy of predicting protein structural properties. In this way, the scheme can utilize the structural information from the fragment library to supplement and refine the information used in protein structure prediction, thereby improving the accuracy of protein structure prediction.

[0003] The summary section is provided to present the chosen concepts in a simplified form, which will be further described in the detailed embodiments below. The summary section is not intended to identify key or principal features of the claimed technical solution, nor is it intended to limit the scope of the claimed technical solution. Attached Figure Description

[0004] Figure 1 A block diagram of a computing device capable of implementing multiple implementations of the present disclosure is shown;

[0005] Figure 2 A schematic diagram showing the structural properties of a protein is provided.

[0006] Figure 3 A schematic diagram illustrates the process of predicting the structure of a protein using structural information from a fragment library, based on some implementations of this disclosure;

[0007] Figure 4 A schematic diagram illustrates a process for predicting the structural properties of a protein using structural information from a fragment library, based on some implementations of this disclosure.

[0008] Figure 5 A schematic diagram is shown illustrating the process of encoding structural information of a fragment library using a feature encoder according to some implementations of this disclosure;

[0009] Figure 6 A schematic diagram illustrating a process for predicting the structural properties of proteins using a property predictor, according to some implementations of this disclosure; and

[0010] Figure 7 A flowchart of a method for protein structure prediction according to an implementation of this disclosure is shown.

[0011] In these accompanying figures, the same or similar reference symbols are used to denote the same or similar elements. Detailed Implementation

[0012] This disclosure will now be discussed with reference to several example implementations. It should be understood that these implementations are discussed only to enable those skilled in the art to better understand and thus implement this disclosure, and not to imply any limitation on the scope of this disclosure.

[0013] As used herein, the term "comprising" and its variations are to be interpreted as open-ended terms meaning "including but not limited to". The term "based on" is to be interpreted as "at least partially based on". The terms "an implementation" and "an implementation" are to be interpreted as "at least one implementation". The term "another implementation" is to be interpreted as "at least one other implementation". The terms "first", "second", etc., may refer to different or the same objects. Other explicit and implicit definitions may also be included below.

[0014] As used in this paper, a "neural network" is capable of processing input and providing corresponding output. It typically includes an input layer, an output layer, and one or more hidden layers between the input and output layers. Neural networks used in deep learning applications often include many hidden layers, thus extending the network's depth. The layers of a neural network are connected sequentially, so that the output of the previous layer is provided as the input to the next layer, where the input layer receives the input to the neural network, and the output layer's output serves as the final output of the neural network. Each layer of a neural network includes one or more nodes (also called processing nodes or neurons), each of which processes the input from the layer above. A CNN is a type of neural network that includes one or more convolutional layers for performing convolution operations on their respective inputs. CNNs can be used in a variety of scenarios and are particularly well-suited for processing image or video data. In this paper, the terms "neural network," "network," and "neural network model" are used interchangeably.

[0015] Protein structure is typically divided into multiple levels, including primary, secondary, and tertiary structures. Primary structure refers to the sequence of amino acid residues, or amino acid sequence. Secondary structure refers to the specific conformations formed by the backbone atoms along certain axes, including α-helices, β-sheets, and random coils. Tertiary structure refers to the three-dimensional spatial structure formed by further coiling and folding of the protein based on its secondary structure. A protein fragment (also simply called a "fragment") consists of a continuous segment of amino acid residues arranged in a three-dimensional spatial structure.

[0016] As mentioned earlier, protein structure primarily influences its function, and protein structure prediction has become an important tool for studying protein structure. Fragment assembly is one method for protein structure prediction, and the quality of the fragment library is a crucial factor affecting the accuracy of fragment assembly. Fragment libraries are constructed based on fragments of proteins with known structures (e.g., native fragments, near-native fragments). For the target protein to be predicted, different fragment library construction algorithms can select as many native or near-native fragments as possible for each residue position (also called "position") of the target protein.

[0017] Fragment libraries contain rich structural information, including but not limited to secondary structures, torsion angles, interatomic distances, and orientations. Although fragment libraries are used in fragment assembly for protein structure prediction, the structural information contained within them has not yet been analyzed and utilized. Furthermore, fragment assembly-based structure prediction is a Monte Carlo simulation process, which is extremely time-consuming.

[0018] Gradient descent is another method for predicting protein structures. In this method, the protein structure is folded by optimizing the potential energy derived from the predicted structural properties. Predicted structural properties can include, for example, the distance between C and N atoms in the main chain, and the twist angle. Since the potential energy is primarily derived from the predicted structural properties, the accuracy of these predictions largely determines the quality of the final predicted structure.

[0019] Currently, the most widely used features for predicting protein structural properties are those derived from the protein's amino acid sequence. In other words, this method utilizes only the amino acid sequence information and does not take advantage of the structural information contained in fragment libraries.

[0020] In view of the above, according to an implementation of this disclosure, a scheme for protein structure prediction is provided, aiming to solve one or more of the above-mentioned problems and other potential problems. In this scheme, multiple fragments are determined from a fragment library targeting a target protein, for each residue position among multiple residue positions of the target protein. Each fragment includes multiple amino acid residues. Then, for each residue position, a feature representation of the structure of the multiple fragments is generated. Next, based on the corresponding feature representations generated for the multiple residue positions, a prediction of at least one of the structure and structural properties of the target protein is determined. In this way, the scheme can utilize the structural information of the fragment library to supplement and improve the information used in protein structure prediction, thereby improving the accuracy of protein structure prediction.

[0021] The following section provides a detailed description of various example implementations of this scheme, with reference to the accompanying drawings.

[0022] Example Environment

[0023] Figure 1 A block diagram of a computing device 100 capable of implementing multiple implementations of the present disclosure is shown. It should be understood that... Figure 1 The computing device 100 shown is merely exemplary and should not constitute any limitation on the functionality and scope of the implementation described in this disclosure. Figure 1 As shown, computing device 100 includes computing device 100 in the form of general computing device. Components of computing device 100 may include, but are not limited to, one or more processors or processing units 110, memory 120, storage device 130, one or more communication units 140, one or more input devices 150, and one or more output devices 160.

[0024] In some implementations, computing device 100 can be implemented as various user terminals or service terminals with computing capabilities. Service terminals can be servers, large computing devices, etc., provided by various service providers. User terminals can be any type of mobile terminal, fixed terminal, or portable terminal, including mobile phones, sites, units, devices, multimedia computers, multimedia tablets, internet nodes, communicators, desktop computers, laptop computers, notebook computers, netbook computers, tablet computers, personal communication system (PCS) devices, personal navigation devices, personal digital assistants (PDAs), audio / video players, digital cameras / camcorders, positioning devices, television receivers, radio receivers, e-book devices, gaming devices, or any combination thereof, including accessories and peripherals of these devices, or any combination thereof. It is also foreseeable that computing device 100 can support any type of user-facing interface (such as "wearable" circuitry).

[0025] Processing unit 110 can be a physical or virtual processor and is capable of performing various processes according to programs stored in memory 120. In a multiprocessor system, multiple processing units execute computer-executable instructions in parallel to improve the parallel processing capability of computing device 100. Processing unit 110 may also be referred to as a central processing unit (CPU), microprocessor, controller, or microcontroller.

[0026] Computing device 100 typically includes multiple computer storage media. Such media can be any available media accessible to computing device 100, including but not limited to volatile and non-volatile media, removable and non-removable media. Memory 120 can be volatile memory (e.g., registers, cache, random access memory (RAM)), non-volatile memory (e.g., read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory), or some combination thereof. Memory 120 may include prediction modules 122, which are configured to perform the functions of the various implementations described herein. Prediction modules 122 can be accessed and executed by processing unit 110 to implement the corresponding functions.

[0027] Storage device 130 may be a removable or non-removable medium and may include machine-readable media capable of storing information and / or data and accessible within computing device 100. Computing device 100 may further include additional removable / non-removable, volatile / non-volatile storage media. Although not explicitly stated... Figure 1 As shown, disk drives for reading from or writing to removable, non-volatile disks and optical disc drives for reading from or writing to removable, non-volatile optical discs can be provided. In these cases, each drive can be connected to a bus (not shown) via one or more data media interfaces.

[0028] The communication unit 140 enables communication with other computing devices via a communication medium. Additionally, the functionality of the components of the computing device 100 can be implemented as a single computing cluster or multiple computing machines capable of communicating via communication connections. Therefore, the computing device 100 can operate in a networked environment using logical connections to one or more other servers, personal computers (PCs), or another general network node.

[0029] Input device 150 can be one or more various input devices, such as a mouse, keyboard, trackball, voice input device, etc. Output device 160 can be one or more output devices, such as a monitor, speaker, printer, etc. Computing device 100 can also communicate as needed with one or more external devices (not shown) via communication unit 140. These external devices include storage devices, display devices, etc., and can communicate with one or more devices that enable user interaction with computing device 100, or with any device that enables computing device 100 to communicate with one or more other computing devices (e.g., network card, modem, etc.). Such communication can be performed via input / output (I / O) interface (not shown).

[0030] In some implementations, in addition to being integrated into a single device, some or all of the components of computing device 100 may be configured in the form of a cloud computing architecture. In a cloud computing architecture, these components can be remotely deployed and can work together to achieve the functionality described herein. In some implementations, cloud computing provides computing, software, data access, and storage services without requiring end users to know the physical location or configuration of the systems or hardware providing these services. In various implementations, cloud computing provides services over a wide area network (such as the Internet) using appropriate protocols. For example, cloud computing providers offer applications over a wide area network, and these applications can be accessed through a web browser or any other computing component. The software or components of the cloud computing architecture, along with the corresponding data, may be stored on servers at remote locations. Computing resources in a cloud computing environment may be consolidated at remote data center locations or they may be distributed. Cloud computing infrastructure can provide services through shared data centers, even if they appear as a single access point for users. Therefore, the components and functionality described herein can be provided from service providers at remote locations using a cloud computing architecture. Alternatively, they may also be provided from conventional servers, or they may be installed directly or otherwise on client devices.

[0031] The computing device 100 can be used to implement protein structure prediction in various implementations of this disclosure. For example... Figure 1As shown, computing device 100 can receive input information 170 related to the target protein to be predicted via input device 150. Input information 170 may include the amino acid sequence 171 of the target protein, indicating the types and order of the amino acids that make up the target protein. Input information 170 may also include a fragment library 172 for the target protein. Fragment library 172 may assign multiple fragments, such as fragment 176, with known structures to each residue position of the target protein. As used herein, the residue position of the target protein (or simply "position") corresponds to an amino acid residue in the target protein. The fragment assigned to the residue position by the fragment library is also called a "template fragment". Such template fragments typically consist of multiple amino acid residues (such as 7 to 15 amino acid residues), thereby containing structural information about these amino acid residues.

[0032] The fragments allocated by fragment library 172 are selected from a large number of fragments obtained by cutting proteins with known structures using a fragment library construction algorithm. Fragment library 172 can be constructed for the target protein based on any suitable fragment library construction algorithm. Suitable fragment library construction algorithms can include, but are not limited to, NNMake, LRFragLib, Flib-Coevo, and DeepFragLib. In some implementations, fragment library 172 can be an initial fragment library constructed for the target protein using a fragment library construction algorithm, for example... Figure 3 The fragment library 310 shown is an example. In some implementations, fragment library 172 can be a processed fragment library derived from an initial fragment library, for example... Figure 3 The processed fragment library 320 shown.

[0033] In some implementations, reference proteins with known structures can be used to evaluate different fragment library construction algorithms. The algorithm for constructing fragment library 172 can then be selected from among the different fragment library construction algorithms based on the evaluation, as will be described in detail below.

[0034] The computing device 100 (e.g., prediction module 122) can extract structural information from the fragment library 172, such as one or more structural properties of the assigned fragments. The computing device 100 can then provide a prediction 180 related to the structure of the target protein based on the extracted structural information. In some implementations, the prediction 180 may include a prediction 181 of the target protein's structure, such as a spatial coordinate representation of the major atoms in the target protein. Alternatively or additionally, in some implementations, the prediction 180 may include a prediction 182 of the target protein's structural properties, such as the torsion angle. Prediction of ψ and ω.

[0035] Despite Figure 1In the example, computing device 100 receives input information 170 from input device 150 and provides prediction results 180 by output device 160, but this is merely illustrative and not intended to limit the scope of this disclosure. Computing device 100 may also receive input information 170 from other devices (not shown) via communication unit 140, and / or provide prediction results 180 externally via communication unit 140. Furthermore, in some implementations, instead of acquiring a pre-constructed fragment library, computing device 100 may utilize a fragment library construction algorithm to construct a fragment library 172 for the target protein.

[0036] Structural properties of proteins and fragments

[0037] As mentioned above, structural information, such as various structural properties of the fragments, is extracted from the fragment library 172 in the implementation of this disclosure. Additionally, in some implementations of this disclosure, the structural properties of the target protein can be predicted. For a better understanding of the implementation of this disclosure, please refer to [reference needed]. Figure 2 To describe the structural properties of proteins. Figure 2 The fragment 200 shown includes residues 210, 220, and 230. Each residue includes N, Cα, and C atoms on the main chain and Cβ and O atoms on the side chain.

[0038] The structural properties of proteins can include the distances between multiple residues. These distances can include the distance between atoms of the same type in two residues, such as Cα-Cα distance and Cβ-Cβ distance. The Cα-Cα distance refers to the distance between pairs of Cα-Cα atoms (also known as the inter-residue Cα distance). The Cα-Cα distance can include the distance between adjacent pairs of Cα atoms or the distance between any non-adjacent pairs of Cα atoms, for example... Figure 2 The distance between any two Cα atoms in Cα atoms 211, 221, and 231. The Cβ-Cβ distance refers to the distance between paired Cβ-Cβ atoms (also known as the inter-residue Cβ distance). The Cβ-Cβ distance can include the distance between adjacent pairs of Cβ atoms or the distance between any non-adjacent pairs of Cβ atoms, for example... Figure 2 The distance between any two Cβ atoms in Cβ atoms 212, 222, and 232.

[0039] The structural properties of proteins can also include the orientation between multiple residues. Residue orientation can include the angles between multiple atoms in two residues, for example... Figure 2 The twist angle shown And ω, principal angles θ and τ, etc. Twist angle The dihedral angle ω refers to the angle relative to the N-Cα bond. The torsion angle ω refers to the dihedral angle relative to the CN bond. For example, for residues 220 and 210, the torsion angle ω... The principal angle ω is the dihedral angle of the chemical bond between N atom 224 and Cα atom 221. For residues 220 and 230, the torsion angle ω is the dihedral angle of the chemical bond between C atom 223 and N atom 234. The principal angle θ is the dihedral angle of the Cα-Cα-Cα chemical bond with respect to adjacent residues. The principal angle τ is the dihedral angle of the Cα-Cα chemical bond with respect to adjacent residues. For example, for residue 220, the principal angle θ is the angle at Cα atom 221 formed by its Cα atom 221 and the Cα atoms 211 and 231 in adjacent residues 210 and 230, and the principal angle τ is the dihedral angle of the line connecting Cα atom 221 and Cα atom 231 (or 211).

[0040] The structural properties of proteins can also include other orientations between the proteins' atoms. For example, structural properties can also include... Figure 2 The twist angle ψ within a residue is shown. The twist angle ψ refers to the dihedral angle relative to the Cα-C chemical bond within the residue. For example, for residue 220, the twist angle ψ is the dihedral angle of the chemical bond between Cα atom 221 and C atom 223. Furthermore, structural properties of proteins can also include bond lengths and bond angles between consecutive atoms in the main chain. Bond lengths can include the bond lengths of N-Cα atoms within a residue, the bond lengths of Cα-C atoms within a residue, the bond lengths between CN atoms of a residue, and so on. Bond angles can include the bond angles between N-Cα-C atoms within a residue, Cα-CN atoms of a residue, CN-Cα atoms of a residue, and so on. Among the structural properties described above, the twist angle... ψ and ω represent the angles between atoms of different types, while the principal angles θ and τ represent the angles between atoms of the same type.

[0041] The structural properties described above are defined at the amino acid residue level. As mentioned above, a fragment comprises a continuous segment of amino acid residues arranged in a three-dimensional spatial structure. Therefore, it is understandable that fragments can also possess the structural properties described above, such as Cα-Cα distance, Cβ-Cβ distance, and torsion angle. ψ, ω, principal angles θ, τ, etc.

[0042] In addition to the structural properties described above, the structural properties of a fragment can also include secondary structures. The secondary structures of a fragment can be classified into four categories: predominantly helical (called H), predominantly folded (called E), predominantly coiled (called C), and others (called O). If more than half of the residues in a fragment have the corresponding secondary structure (H, E, or C), then the secondary structure of the fragment is defined as H, E, or C accordingly. Otherwise, the secondary structure of the fragment is defined as O.

[0043] In some implementations, computing device 100 can extract one or more of the structural properties described above from fragments allocated from fragment library 172 to predict the structure of a target protein, as will be referred to below. Figure 3 As described. In some implementations, the computing device can use structural properties extracted from fragment library 172 to predict one or more of the structural properties of a target protein, as will be referred to below. Figures 4 to 6 As described.

[0044] Evaluation of the fragment library

[0045] Fragment libraries constructed using different fragment library construction algorithms (hereinafter referred to as "algorithms") may exhibit varying performance. In some implementations, evaluation metrics can be used to assess the performance of fragment libraries constructed by different algorithms. Specifically, multiple reference fragment libraries can be constructed for a reference protein using different algorithms, the structure of which is known. Then, for each reference fragment library, the attribute values ​​(also known as "reference attribute values") of the structural properties of multiple reference fragments assigned from each reference fragment library to reference residue positions of the reference protein can be determined, as well as the attribute values ​​(also known as "reference attribute values") of the structural properties of the reference protein at the reference residue positions. The difference between the reference attribute value and the actual attribute value for the same structural property can be used as an evaluation metric.

[0046] Evaluation metrics used to assess fragment libraries constructed using different algorithms typically include precision and coverage. Precision refers to the proportion of good fragments in the entire fragment library, while coverage is the proportion of residue positions spanned by at least one good fragment, where a good fragment is defined as a fragment whose root mean square deviation (RMSD) from the true fragment at that position is below a predetermined RMSD. Therefore, a good fragment can be defined as a fragment whose similarity to the true fragment at that position exceeds a threshold similarity.

[0047] Precision and coverage, as classic metrics, cannot reflect the accuracy of fragment structural properties. Therefore, in some implementations of this disclosure, evaluation metrics related to structural properties can also be used to comprehensively evaluate the fragment library. Such structural properties may include, for example, secondary structure and torsion angle. ψ, ω, principal angles θ, τ, and paired Cα-Cα distances and Cβ-Cβ distances, etc. In the implementation of this disclosure, the evaluation metric can be defined as the accuracy or error of these structural properties at the fragment level.

[0048] In some implementations, the evaluation metric may include the accuracy of the secondary structure at the fragment level. As described above, the secondary structure of a fragment can be divided into H, E, C, and O. Therefore, the accuracy of the secondary structure at the fragment level can be expressed as:

[0049]

[0050]

[0051] Where FL represents the fragment library, E represents the mathematical expectation, and p i f represents all fragments at position i (i.e., all fragments assigned to position i by the fragment library), i f represents a segment at position i. * Here, SS(f) represents the corresponding real fragment of the reference protein, and SS(f) represents the secondary structure of fragment f. Therefore, the accuracy (ACC) of the secondary structure of the entire fragment library is [not specified]. SS (FL) is defined as the expected accuracy at all locations, where the accuracy at each location is defined as the expected accuracy of all template fragments at that location.

[0052] Alternatively or additionally, in some implementations, the evaluation metric may include errors in structural properties at the fragment level, such as angles. Errors in ψ, ω, θ, and τ. Angles The errors of ψ, ω, θ, and τ can be expressed as:

[0053]

[0054]

[0055] Where ang represents angle. Any one of ψ, ω, θ, and τ, where |x| represents the absolute value of x. Representing fragment f i The angle value of residue j, where N represents fragment f. i The number of residues, This represents the true angle value of the corresponding residue j in the reference protein, err ang (f i f * ) represents fragment f i The mean absolute error (MAE) of the corresponding angle. Therefore, the angle... The errors of ψ, ω, θ, and τ can be defined as the expected angle errors at all positions, where the angle error at each position can be defined as the expected angle errors of all template segments assigned to that position.

[0056] Alternatively or additionally, in some implementations, the evaluation metric may include the error in the inter-residue distance, such as the error in the Cα-Cα distance and the error in the Cβ-Cβ distance. The errors in the Cα-Cα distance and the Cβ-Cβ distance can be expressed as:

[0057]

[0058] Where err dist (f i f * ) represents fragment f i The Cα-Cα distance or Cβ-Cβ distance within the reference protein and the corresponding fragment f * The MAE is calculated by comparing the true Cα-Cα distance or the true Cβ-Cβ distance within the range.

[0059] The above reference equations (1) to (5) describe the evaluation metrics of the segment level related to structural properties, including the accuracy and angle of the secondary structure. The errors are: the error of angle ψ, the error of angle ω, the error of angle θ, the error of angle τ, the error of Cα-Cα distance, and the error of Cβ-Cβ distance.

[0060] In some implementations, one or more of these evaluation metrics can be used to evaluate fragment libraries built by different algorithms. Fragment libraries with higher secondary structure accuracy and smaller angle or distance errors can be considered to have better performance.

[0061] In some implementations, an algorithm can be selected for constructing a fragment library for a target protein based on evaluations of fragment libraries constructed by different algorithms. For example, multiple reference fragment libraries can be constructed for a reference protein using different algorithms. Then, for each reference fragment library, reference attribute values ​​for the structural properties of multiple reference fragments assigned from each reference fragment library to reference residue positions of the reference protein can be determined, such as in equation (3). Since the reference protein has a known structure, the true structural property values ​​of the reference protein at the reference residue sites can be determined, for example, in (3). Then, the difference between the reference attribute value and the true attribute value can be determined, for example, by calculating the error according to equation (4). Next, an algorithm can be selected based on the determined difference.

[0062] As an example, fragment libraries FA, FB, and FC for a reference protein can be constructed according to algorithms A, B, and C, respectively. Then, for each of the fragment libraries FA, FB, and FC, the evaluation metrics defined by equations (2), (4), and (5) can be calculated. If the fragment library FA outperforms the fragment libraries FB and FC according to the number of evaluation metrics exceeding the threshold (e.g., 3), then algorithm A can be selected to construct the fragment library 172 for the target protein.

[0063] In this implementation, fragment-level evaluation metrics can be used to comprehensively assess the structural information contained in the fragment library, thereby evaluating the performance of different fragment library construction algorithms. In this way, a higher-performing fragment library construction algorithm can be selected to build a fragment library for the target protein, which helps improve the accuracy of protein structure prediction or structural property prediction.

[0064] Prediction of protein structure

[0065] In some implementations, structural information from a fragment library 172 targeting the target protein can be used to predict the structure of the target protein. For example, the prediction module 122 can determine the attribute value of a structural property of each fragment assigned to that residue position, such as an angle. The prediction module 122 can then determine a feature representation, such as a probability distribution, of the considered structural properties for each residue position of the target protein. Based on this feature representation, the prediction module 122 can predict the structure of the protein.

[0066] The following text will be from the perspective of angle ψ, θ, τ, and the Cα-Cα and Cβ-Cβ distances are used as examples of structural properties to describe an exemplary process for predicting protein structure. However, it should be understood that this is merely exemplary and not intended to limit the scope of this disclosure; other structural properties can be used to predict protein structure.

[0067] Figure 3 A schematic diagram is shown of a process 300 for predicting protein structure using structural information from a fragment library, according to some implementations of this disclosure. Figure 3 In the example, prediction module 122 can extract multiple structural properties for each fragment from a fragment library targeting the protein, including angles. ψ, θ, τ, as well as Cα-Cα distance and Cβ-Cβ distance, etc.

[0068] The initial fragment library 310 constructed by the fragment library construction algorithm can assign multiple initial fragments to each position of the target protein, such as fragments 311, 312, and 313. Figure 3 As shown, the lengths of the initial fragments may vary. The “fragment length” described herein refers to the number of amino acid residues included in the fragment. For example, fragment 311 has 9 amino acid residues and a length of 9; fragment 312 has 7 amino acid residues and a length of 7; and fragment 313 has 7 amino acid residues and a length of 7.

[0069] In some implementations, prediction module 122 can obtain processed fragment library 320 by processing initial fragment library 310, in which multiple fragments assigned to the same position can have the same length. Prediction module 122 can generate fragments with a predetermined number of residues from the initial fragments in initial fragment library 310. As an example, prediction module 122 can perform a smoothing operation on fragments whose length exceeds a threshold. This smoothing operation can slice the initial fragments into a series of fragments including a predetermined number of residues using a sliding window. This smoothing operation can result in all fragments assigned to the same position having the same length. Figure 3 In the example, the sliding window of the smoothing operation has a length of 7. Accordingly, the prediction module 122 can generate segments 321, 322, and 323 of length 7 from an initial segment 311 of length 9. It is understood that... Figure 3 The lengths of the fragments in the processed fragment library 320 shown are merely exemplary and are not intended to limit the scope of this disclosure. In implementations of this disclosure, fragments assigned to residue positions can be processed to have any suitable length.

[0070] Then, prediction module 122 can, for each residue position, determine the probability distribution of structural properties at that residue position as a feature representation of the structural properties, based on the structure of multiple fragments assigned to that position. Figure 3 In the example, prediction module 122 can determine the angle for each residue position. The probability distributions of ψ, θ, τ, Cα-Cα distance dα and Cβ-Cβ distance dβ.

[0071] The following describes how a Gaussian mixture model can be used to depict the probability distribution of structural properties at each residue site. However, it should be understood that this is merely exemplary and not intended to limit the scope of this disclosure, and any suitable model can be used in implementations of this disclosure to depict the probability distribution of structural properties.

[0072] Some of the fragments assigned to residue position i by fragment library 320 may be good fragments, while others may be not good enough. As mentioned earlier, RMSD can be used to assess whether a fragment is good or not. Given that each fragment assigned by fragment library 320 can have a predicted RMSD value, this predicted RMSD value can be regarded as a confidence score for that fragment. For example, prediction module 122 can assign weights to each fragment at the same residue position i according to the following formula.

[0073]

[0074] Where F represents all segments at the same residue position i, f iPredRMSD represents a fragment in fragment set F. i Representing fragment f i The predicted RMSD value, where T represents temperature.

[0075] Equation (7) shows the probability density function of the Gaussian distribution:

[0076]

[0077] Where y is derived from equation (6) The weighted structural property values, μ and σ 2 These represent the mean and variance, respectively.

[0078] Then, prediction module 122 can establish a weighted Gaussian mixture model (wGMM) 330 for each structural property at each residue position. This weighted Gaussian mixture model 330 can have any suitable number of components. A component refers to the number of Gaussian distributions in the weighted Gaussian mixture model. In the implementation of this disclosure, the weighted Gaussian mixture models established for different residue positions can have the same or different numbers of components. Figure 3 In the example, the fragment assigned to each residue position has a length of 7, that is, 7 residues. Therefore, for each residue position, the prediction module 122 can be configured to predict angles respectively. Seven wGMMs are created for each of ψ, θ, and τ, and 21 wGMMs are created for each of the Cα-Cα distance dα and Cβ-Cβ distance dβ, resulting in a total of 70 wGMMs. Figure 3 The example shows the angle. Gaussian distribution 331, Gaussian distribution of angle ψ 332, Gaussian distribution of angle θ 333, Gaussian distribution of angle τ 334, and Gaussian distribution of distance d (either of Cα-Cα distance dα and Cβ-Cβ distance dβ) 335.

[0079] In this way, prediction module 122 can determine the Gaussian distribution of the considered structural property at each residue position as a feature representation, which is also referred to herein as the "first feature representation". Then, prediction module 122 can generate a potential function corresponding to the structural property based on the Gaussian distribution at multiple residue positions of the target protein.

[0080] In some implementations, the negative log-likelihood function can be used to convert the Gaussian distribution into a potential function. Understandably, since wGMM is specific to the target protein, the potential function derived from the fragment in this way is tailored to the target protein. Equations (8) and (9) show examples of potential functions for structural properties:

[0081]

[0082]

[0083] Equation (8) is related to the angle The corresponding potential function, Equation (9), is the potential function corresponding to the Cβ-Cβ distance, where x represents the predicted structure of the target protein, K is the number of components in wGMM, and w, μ, and σ are the fitting parameters for each component in wGMM. It is at the i-th residue in structure x Angle, f i This represents the fragment assigned to the i-th residue, where m represents the angle relative to the i-th residue. The number of wGMMs established (e.g., the 7 mentioned above), It is in f i The distance between the Cβ atom of the j1-th residue and the Cβ atom of the j2-th residue is denoted by n, which represents the number of wGMMs established for the Cβ-Cβ distance of the i-th residue (e.g., 21 mentioned above). Potential functions corresponding to angles ψ, θ, and τ can be defined in a manner similar to Equation (8), and potential functions corresponding to Cα-Cα distances can be defined in a manner similar to Equation (9). In this way, when extracting six structural properties from the fragment, a total of six potential functions can be defined, with one potential function for each structural property.

[0084] After determining the potential functions corresponding to the various structural properties, the prediction module 122 can determine the objective function for the structure prediction model 340 based on the determined potential functions. The structure prediction model 340 can be configured to predict the structure of a protein by minimizing the objective function. For example, the structure prediction model 340 can be a gradient descent-based protein folding framework.

[0085] The structural properties under consideration include angles. Given ψ, θ, τ, Cα-Cα distance dα and Cβ-Cβ distance dβ, the combined potential function L FL (x) can be expressed as:

[0086]

[0087] Where L FL (x) is defined as the weighted sum of six potential energy functions. L ψ (x), L θ (x), L τ (x), L Cα (x), L Cβ (x) represents the angle. Potential energy functions for ψ, θ, τ, Cα-Cα distance dα and Cβ-Cβ distance dβ, w ψ w θ w τ w Cα w Cβ They represent angles respectively. The weights of the potential functions for ψ, θ, τ, Cα-Cα distance dα, and Cβ-Cβ distance dβ. The weights in Equation (10) can be considered as hyperparameters and can be tuned on a reference dataset (such as CASP12FM), which includes information about reference proteins with known structures. For example, the weights in Equation (10) can be tuned on the reference dataset by maximizing the average template modeling (TM) score of the predicted structures.

[0088] The combined potential function shown in equation (10) can be used as part of the objective function. The objective function may also include one or more geometric potential functions to constrain the geometry of the target protein, such that the predicted structure is biophysically plausible. Thus, prediction module 122 can determine the objective function for structure prediction model 340. Next, prediction module 122 can utilize structure prediction model 340 to generate the predicted structure 350 of the target protein by minimizing the objective function. For example, prediction module 122 can compute and minimize the objective function at each step of the gradient descent process to update the structure of the target protein.

[0089] The above describes an example implementation of predicting protein structure using structural information from a fragment library. In this implementation, the structural features of fragments in the library are explicitly represented using probability distributions of structural properties, and a protein-specific potential function is determined based on these probability distributions. This potential function derived from the fragment library can then be used in structure prediction models, such as gradient descent-based protein folding models, to predict protein structure. This method using the potential function derived from the fragment library outperforms methods that do not use the potential function derived from the fragment library in several aspects (e.g., average TM score of the best predicted structure (decoy), number of best predicted structures with TM scores greater than 0.5, etc.). Therefore, structural information from the fragment library can facilitate structure prediction models in finding more realistic structures for target proteins.

[0090] Prediction of protein structural properties

[0091] In the implementation described above, the structure of the protein is predicted using an explicit representation of the structural information from the fragment library. Alternatively or additionally, in some implementations, the structural information from the fragment library 172 for the target protein can be used to predict the structural properties of the target protein. For example, the prediction module 122 can determine multiple structural properties, such as angles, for each residue position among multiple fragments assigned to that residue position. Two or more of ψ, ω, bond length, and bond angle. Then, prediction module 122 can utilize a trained feature encoder to encode multiple structural properties determined for each of the multiple fragments, thereby determining a feature representation of the structure of those fragments. Prediction module 122 can predict the structural properties of the target protein based on the feature representation determined for each residue position and the feature representation of the amino acid sequence (also referred to herein as a “second feature representation”).

[0092] Figure 4 A schematic diagram illustrates a process 400 for predicting the structural properties of a protein using structural information from a fragment library, based on some implementations of this disclosure. Figure 4 In the example, fragment library attribute set 410 is first extracted from fragment library 172. Specifically, prediction module 122 can select a predetermined number F fragments from the fragments assigned to that residue position by fragment library 172 for each residue position of the target protein, where F is a positive integer, such as 50. For example, prediction module 122 can select the F fragments with the lowest predicted RMSD values ​​from the assigned fragments. Prediction module 122 can extract multiple structural attributes for each of the F fragments for each residue position, such as one-hot codes of residue secondary structures (e.g., "0001" for H, "0010" for E, "0100" for C, "1000" for O), torsion angle, etc. The sine and cosine values ​​of ψ and ω, the bond lengths between CN, N-Cα, and Cα-C atoms of each residue, and the bond angles between Cα-CN, CN-Cα, and N-Cα-C of each residue, etc. If the lengths of the F fragments are different, all F fragments can be padded to a length of a predetermined number R residues, where R is a positive integer, such as 15. In this way, the prediction module 122 can determine the fragment library attribute set 410 from the fragment library 172 for the target protein. The fragment library attribute set 410 can be represented as an L×F×R×D tensor, where L represents the length of the target protein, i.e., the number of amino acid residues, and D represents the dimension of the structural attributes extracted from the fragments.

[0093] The fragment library attribute set 410 can then be input into the trained feature encoder 420. The feature encoder 420 can generate a fragment library feature set 430 by encoding the fragment library attribute set 410. The fragment library feature set 430 can include encoded structural attributes for each residue position. That is, the feature encoder 420 can obtain the structural features at each residue position based on the structural attributes of multiple fragments.

[0094] refer to Figure 5 . Figure 5 A schematic diagram is shown of a process 500 in which structural information of a fragment library is encoded using a feature encoder 420 according to some implementations of this disclosure. For example... Figure 5 As shown, the feature encoder 420 has a hierarchical structure, comprising a three-level encoding process. First, in the convolution process 510, the fragment library attribute set 410, represented by an L×F×R×D tensor, is convolved. As an example, each building block constituting the convolutional network may include two convolutional layers for performing convolution operations on the third dimension (the dimension of the fragment length) of the input L×F×R×D tensor. Furthermore, an exponential linear unit (ELU) activation layer may be employed between the two convolutional layers. These two convolutional layers can have convolutional kernels of any suitable size and any suitable number of filters. If the number of filters used in the convolution process 510 is d, then the dimension of the tensor output by the convolution process 510 is L×F×R×d, as shown below. Figure 5 As shown. The function of convolution process 510 is to learn the interactions between adjacent residues in a fragment. For this purpose, a certain number (e.g., 8) of building blocks can be stacked using skip connections. The convolution process 510 described above is merely exemplary and is not intended to limit the scope of this disclosure. In implementations of this disclosure, convolution process 510 can be implemented using any suitable method.

[0095] After performing the convolution process 510, multiple structural attributes are converted into implicit representations. Next, in the selection process 520, for each fragment at each residue position, the implicit representation of one residue from that fragment can be selected. For example, given that the index of the first residue of a fragment corresponds to the residue position of the target protein, the implicit representation of the first residue of each fragment can be selected. Thus, the feature map output by the selection process 520 has dimensions of L×F×d, as shown below. Figure 5 As shown.

[0096] Finally, in the averaging process 530, an output tensor of dimension L×d can be obtained as the fragment library feature set 430 by averaging all F fragments at the same residue position. In the fragment library feature set 430, the 1×d vector corresponding to each residue position can be regarded as the feature representation of the fragment determined for that residue position.

[0097] Continue to refer to Figure 4 A fragment library feature set 430 of dimension L×d is input to a trained attribute predictor 440. The attribute predictor 440 also receives a sequence feature set 450 of the amino acid sequence of the target protein. The sequence feature set 450 may include at least one of the following: the basic sequence of the target protein, a position-specific frequency matrix (PSSM) of homologous proteins, and pairwise statistics derived from direct coupling analysis (DCA). For example, the fragment library feature set 430 output by the feature encoder 420, along with the one-hot encoding and PSSM of the basic sequence of the target protein, can be converted into a two-dimensional feature representation by horizontal and vertical tiling, and then concatenated with pairwise statistics to form the total input to the attribute predictor 440.

[0098] The attribute predictor 440 is trained to predict the structural attributes 460 of a target protein based on the feature representations of fragments in a fragment library and the feature representations of amino acid sequences, for example... Figure 4 The twist angle shown ψ and ω, the bond lengths between CN, N-Cα, and Cα-C atoms of each residue, the bond angles of Cα-CN, CN-Cα, and N-Cα-C of each residue, and the Cα-Cα and Cβ-Cβ distances.

[0099] refer to Figure 6 . Figure 6 A schematic diagram of a process 600 for predicting the structural properties of a protein using an attribute predictor 440, according to some implementations of this disclosure, is shown. The fragment library feature set 430 and sequence feature set 450 input to the attribute predictor 440 can first be processed by a preprocessing block 610. As an example, the preprocessing block 610 may include two-dimensional convolutional layers, batch normalization layers, and ELU activation layers, etc. Following the preprocessing block 610 is a two-dimensional residual neural network with multiple (e.g., 30) residual blocks. As an example, Figure 6 Each residual block is shown to include two convolutional layers 621, 625 and two ELU activation layers 623, 627. Furthermore, to prevent overfitting, batch normalization layers 622, 626 can be used after the convolutional layers 621, 625, and a dropout layer 624, for example, with a dropout rate of 0.15, can be used.

[0100] A symmetry operation 630 is performed on the output of the residual network. The output of the symmetry operation 630 is then input into the two corresponding branches to predict different structural properties. Figure 6 The left branch shown includes a pooling layer 640, which converts the two-dimensional feature map output by the symmetry operation 630 into a one-dimensional feature vector. This one-dimensional feature vector is then input to a fully connected layer 650, which outputs the 1D structural properties of each residue of the target protein, such as the torsion angle. ψ, ω, and the bond lengths and bond angles between continuous main chain atoms. Figure 6 The right-hand branch shown directly predicts Cα-Cα and Cβ-Cβ distances using the fully connected layer 660. In this example, the attribute predictor 440 can be implemented as a multi-task predictor to simultaneously predict multiple structural properties of the target protein.

[0101] Continue to refer to Figure 4 During training, the feature encoder 420 and attribute predictor 440 can be jointly trained using the training dataset. The sum of the MAEs of all output structural attributes can be used as the loss function.

[0102] The above describes an example implementation of predicting protein structural properties using structural information from a fragment library. In this implementation, features generated by a feature encoder are used to implicitly represent the structural features of fragments in the fragment library. These implicit features derived from the fragment library can then be fed into an attribute predictor to predict one or more structural properties of the protein. Compared to methods that do not use implicit features derived from the fragment library, this method can improve the accuracy of structural property prediction.

[0103] Example methods and example implementations

[0104] Figure 7 A flowchart of a method 700 for protein structure prediction according to some implementations of the present disclosure is shown. Method 700 can be implemented by computing device 100, for example, it can be implemented at prediction module 122 in memory 120 of computing device 100.

[0105] like Figure 7 As shown, in box 710, computing device 100 determines multiple fragments from a fragment library targeting the target protein, for each residue position among multiple residue positions of the target protein. Each fragment includes multiple amino acid residues.

[0106] In some implementations, in order to determine multiple fragments, computing device 100 may determine an initial fragment allocated to each residue position by a fragment library; and from the initial fragment, generate fragments having a predetermined number of residues as multiple fragments.

[0107] At box 720, computing device 100 generates a first feature representation of the structure of multiple fragments for each residue position. For example, computing device 100 may determine a Gaussian distribution of structural properties at each residue position, or it may generate a fragment library feature set 430. At box 730, computing device 100 determines a prediction of at least one of the structure and structural properties of the target protein based on the corresponding first feature representations generated for the multiple residue positions.

[0108] In some implementations, to generate a first feature representation, computing device 100 may, for each residue position, determine the attribute value of the structural attribute of each fragment based on the structure of multiple fragments; and, based on the attribute values ​​of the structural attributes of multiple fragments, determine the probability distribution of the structural attribute at each residue position as the first feature representation. In some implementations, to determine a prediction of the structure of a target protein, computing device 100 may, based on the corresponding probability distribution at multiple residue positions, generate a potential function corresponding to the structural attribute; based on the potential function, determine an objective function for a structure prediction model used to predict the protein's structure; and, using the structure prediction model, determine a prediction of the spatial structure of the target protein by minimizing the objective function.

[0109] In some implementations, structural properties may include at least one of the following: angles between atoms of different types, angles between atoms of the same type, or distances between atoms of the same type.

[0110] In some implementations, to generate a first feature representation, computing device 100 may determine multiple structural attributes for each of the multiple fragments based on the structure of the multiple fragments at each residue position; and determine the first feature representation by encoding the multiple structural attributes of each of the multiple fragments using a trained feature encoder. In some implementations, to determine a prediction of the structure of a target protein, computing device 100 may determine a second feature representation of the amino acid sequence of the target protein, the amino acid sequence indicating the residue type at each of the multiple residue positions; and determine a prediction of the structural attributes of the target protein based on the corresponding first and second feature representations determined for the multiple residue positions using a trained attribute predictor.

[0111] In some implementations, method 700 further includes: for each of a plurality of reference fragment libraries constructed for a reference protein based on different algorithms, determining reference attribute values ​​for the structural properties of a plurality of reference fragments assigned by each reference fragment library to reference residue positions of the reference protein; determining the true attribute values ​​for the structural properties of the reference protein at the reference residue positions; and determining the differences between the reference attribute values ​​and the true attribute values. Method 700 also includes selecting a target algorithm from a plurality of algorithms for constructing fragment libraries for the target protein based on the corresponding differences determined for the plurality of reference fragment libraries.

[0112] As can be seen from the above description, the protein structure prediction scheme implemented according to this disclosure can utilize structural information from a fragment library to supplement and improve the information used in protein structure prediction. In this way, the accuracy of protein structure prediction can be improved.

[0113] The following are some example implementations of this disclosure.

[0114] In one aspect, this disclosure provides a computer-implemented method. The method includes: determining a plurality of fragments from a fragment library targeting a target protein, each fragment comprising a plurality of amino acid residues for each residue position of the target protein; generating a first feature representation of the structure of the plurality of fragments for each residue position; and determining a prediction of the structure of the target protein based on the corresponding first feature representation generated for the plurality of residue positions.

[0115] In some implementations, generating the first feature representation includes: for each residue position, determining the attribute value of a structural attribute of each of the plurality of fragments based on the structure of the fragments; and determining the probability distribution of the structural attribute at each residue position as the first feature representation based on the attribute value of the structural attribute of the plurality of fragments.

[0116] In some implementations, determining the prediction of the structure of the target protein includes: generating a potential function corresponding to the structural properties based on the corresponding probability distributions at the plurality of residue positions; determining an objective function for a structure prediction model to predict the structure of the protein based on the potential function; and using the structure prediction model to determine the prediction of the structure of the target protein by minimizing the objective function.

[0117] In some implementations, determining the plurality of fragments includes: determining an initial fragment allocated by the fragment library to each residue position; and generating, from the initial fragment, a fragment having a predetermined number of residues as the plurality of fragments.

[0118] In some implementations, the structural properties include at least one of the following: angles between atoms of different types, angles between atoms of the same type, or distances between atoms of the same type.

[0119] In some implementations, generating the first feature representation includes: determining multiple structural properties of each of the plurality of fragments for each residue position, based on the structure of the plurality of fragments; and determining the first feature representation by encoding the multiple structural properties of each of the plurality of fragments using a trained feature encoder.

[0120] In some implementations, determining a prediction of the structure of the target protein includes: determining a second feature representation of the amino acid sequence of the target protein, the amino acid sequence indicating the residue type at each of the plurality of residue positions; and using a trained attribute predictor, determining a prediction of the structural properties of the target protein based on the corresponding first feature representation and second feature representation determined for the plurality of residue positions.

[0121] In some implementations, the plurality of structural properties include at least one of the following: angles between atoms of different types, angles between atoms of the same type, or distances between atoms of the same type.

[0122] In some implementations, the method further includes: for each of a plurality of reference fragment libraries constructed for a reference protein based on different algorithms, determining reference attribute values ​​for the structural properties of a plurality of reference fragments allocated by each reference fragment library to reference residue positions of the reference protein, determining true attribute values ​​for the structural properties of the reference protein at the reference residue positions, determining differences between the reference attribute values ​​and the true attribute values; and selecting a target algorithm from the plurality of algorithms based on the corresponding differences determined for the plurality of reference fragment libraries, for constructing the fragment library for the target protein.

[0123] In another aspect, this disclosure provides an electronic device. The electronic device includes: a processing unit; and a memory coupled to the processing unit and containing instructions stored thereon, the instructions, when executed by the processing unit, causing the device to perform actions including: determining a plurality of fragments from a fragment library targeting a target protein, for each residue position of a plurality of residue positions of the target protein, each fragment comprising a plurality of amino acid residues; generating a first feature representation of the structure of the plurality of fragments for each residue position; and determining a prediction of the structure of the target protein based on the corresponding first feature representation generated for the plurality of residue positions.

[0124] In some implementations, generating the first feature representation includes: for each residue position, determining the attribute value of a structural attribute of each of the plurality of fragments based on the structure of the fragments; and determining the probability distribution of the structural attribute at each residue position as the first feature representation based on the attribute value of the structural attribute of the plurality of fragments.

[0125] In some implementations, determining the prediction of the structure of the target protein includes: generating a potential function corresponding to the structural properties based on the corresponding probability distributions at the plurality of residue positions; determining an objective function for a structure prediction model to predict the structure of the protein based on the potential function; and using the structure prediction model to determine the prediction of the structure of the target protein by minimizing the objective function.

[0126] In some implementations, determining the plurality of fragments includes: determining an initial fragment allocated by the fragment library to each residue position; and generating, from the initial fragment, a fragment having a predetermined number of residues as the plurality of fragments.

[0127] In some implementations, the structural properties include at least one of the following: angles between atoms of different types, angles between atoms of the same type, or distances between atoms of the same type.

[0128] In some implementations, generating the first feature representation includes: determining multiple structural properties of each of the plurality of fragments for each residue position, based on the structure of the plurality of fragments; and determining the first feature representation by encoding the multiple structural properties of each of the plurality of fragments using a trained feature encoder.

[0129] In some implementations, determining a prediction of the structure of the target protein includes: determining a second feature representation of the amino acid sequence of the target protein, the amino acid sequence indicating the residue type at each of the plurality of residue positions; and using a trained attribute predictor, determining a prediction of the structural properties of the target protein based on the corresponding first feature representation and second feature representation determined for the plurality of residue positions.

[0130] In some implementations, the plurality of structural properties include at least one of the following: angles between atoms of different types, angles between atoms of the same type, or distances between atoms of the same type.

[0131] In some implementations, the method further includes: for each of a plurality of reference fragment libraries constructed for a reference protein based on different algorithms, determining reference attribute values ​​for the structural properties of a plurality of reference fragments allocated by each reference fragment library to reference residue positions of the reference protein, determining true attribute values ​​for the structural properties of the reference protein at the reference residue positions, determining differences between the reference attribute values ​​and the true attribute values; and selecting a target algorithm from the plurality of algorithms based on the corresponding differences determined for the plurality of reference fragment libraries, for constructing the fragment library for the target protein.

[0132] In another aspect, this disclosure provides a computer program product tangibly stored in a non-transient computer storage medium and including machine-executable instructions that, when executed by a device, cause the device to perform the methods described above.

[0133] In another aspect, this disclosure provides a computer-readable medium having stored machine-executable instructions thereon, which, when executed by a device, cause the device to perform the methods described above.

[0134] The functions described above in this document can be performed at least in part by one or more hardware logic components. For example, exemplary types of hardware logic components that can be used, without limitation, include: field programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), application-specific standard products (ASSPs), systems-on-a-chip (SoCs), payload programmable logic devices (CPLDs), and so on.

[0135] The program code used to implement the methods of this disclosure may be written in any combination of one or more programming languages. This program code may be provided to a processor or controller of a general-purpose computer, special-purpose computer, or other programmable data processing apparatus, such that when executed by the processor or controller, the program code causes the functions / operations specified in the flowcharts and / or block diagrams to be implemented. The program code may be executed entirely on a machine, partially on a machine, as a standalone software package partially on a machine and partially on a remote machine, or entirely on a remote machine or server.

[0136] In the context of this disclosure, a machine-readable medium can be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium can be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium can be, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.

[0137] Furthermore, although the operations are described in a specific order, this should be understood as requiring that such operations be performed in the specific order shown or in sequential order, or requiring that all illustrated operations be performed to achieve the desired result. In certain environments, multitasking and parallel processing may be advantageous. Similarly, although several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of this disclosure. Certain features described in the context of a single implementation may also be implemented in combination in a single implementation. Conversely, various features described in the context of a single implementation may also be implemented individually or in any suitable sub-combination in multiple implementations.

[0138] Although the subject matter has been described using language specific to structural features and / or methodological logic, it should be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or actions described above. Rather, the specific features and actions described above are merely illustrative examples of implementing the claims.

Claims

1. A computer-implemented method, comprising: From a fragment library targeting a target protein, multiple fragments are determined for each residue position in a plurality of residue positions of the target protein, each fragment comprising multiple amino acid residues; For each residue position, a first feature representation of the structure of the plurality of fragments is generated; as well as Based on the corresponding first feature representations generated for the plurality of residue positions, a prediction of at least one of the structure and structural properties of the target protein is determined. The generation of the first feature representation includes: For each residue position, based on the structure of the plurality of fragments, determine multiple structural properties of each fragment; and The first feature representation is determined by encoding the plurality of structural attributes of each of the plurality of segments using a trained feature encoder. The determination of the predicted structure of the target protein includes: A second characteristic representation of the amino acid sequence of the target protein is determined, the amino acid sequence indicating the residue type at each of the plurality of residue positions; and Using a trained attribute predictor, predictions of the structural attributes of the target protein are determined based on the corresponding first feature representation and second feature representation determined for the plurality of residue positions.

2. The method of claim 1, wherein generating the first feature representation comprises: For each residue position, based on the structure of the plurality of fragments, the attribute value of the structural property of each fragment is determined; as well as Based on the attribute values ​​of the structural attributes of the plurality of fragments, the probability distribution of the structural attributes at each residue position is determined as the first feature representation.

3. The method of claim 2, wherein determining the predicted structure of the target protein comprises: Based on the corresponding probability distributions at the multiple residue positions, a potential function corresponding to the structural properties is generated; Based on the potential energy function, determine the objective function of the structure prediction model used to predict the structure of proteins; as well as The structure prediction model is used to determine the predicted structure of the target protein by minimizing the objective function.

4. The method of claim 2, wherein determining the plurality of fragments comprises: Determine the initial fragments allocated from the fragment library to each residue position; as well as From the initial fragment, fragments with a predetermined number of residues are generated as the plurality of fragments.

5. The method according to claim 2, wherein the structural property includes at least one of the following: The angle between different types of atoms The angle between atoms of the same type, or The distance between atoms of the same type.

6. The method according to claim 1, further comprising: For each of the multiple reference fragment libraries constructed for the reference protein based on different algorithms, Determine the reference attribute values ​​for the structural properties of multiple reference fragments assigned by each reference fragment library to the reference residue positions of the reference protein; Determine the actual attribute value of the structural property of the reference protein at the reference residue position; Determine the difference between the reference attribute value and the actual attribute value; as well as Based on the corresponding differences determined for the plurality of reference fragment libraries, a target algorithm is selected from the plurality of algorithms to construct the fragment library for the target protein.

7. An electronic device, comprising: Processing unit; as well as A memory, coupled to the processing unit and containing instructions stored thereon, which, when executed by the processing unit, cause the device to perform actions, including: From a fragment library targeting a target protein, multiple fragments are determined for each residue position in a plurality of residue positions of the target protein, each fragment comprising multiple amino acid residues; For each residue position, a first feature representation of the structure of the plurality of fragments is generated; and Based on the corresponding first feature representations generated for the plurality of residue positions, a prediction of at least one of the structure and structural properties of the target protein is determined. The generation of the first feature representation includes: For each residue position, based on the structure of the plurality of fragments, determine multiple structural properties of each fragment; and The first feature representation is determined by encoding the plurality of structural attributes of each of the plurality of segments using a trained feature encoder. The determination of the predicted structure of the target protein includes: A second characteristic representation of the amino acid sequence of the target protein is determined, the amino acid sequence indicating the residue type at each of the plurality of residue positions; and Using a trained attribute predictor, predictions of the structural attributes of the target protein are determined based on the corresponding first feature representation and second feature representation determined for the plurality of residue positions.

8. The device of claim 7, wherein generating the first feature representation comprises: For each residue position, based on the structure of the plurality of fragments, the attribute value of the structural property of each fragment is determined; as well as Based on the attribute values ​​of the structural attributes of the plurality of fragments, the probability distribution of the structural attributes at each residue position is determined as the first feature representation.

9. The apparatus of claim 8, wherein determining the prediction of the structure of the target protein comprises: Based on the corresponding probability distributions at the multiple residue positions, a potential function corresponding to the structural properties is generated; Based on the potential energy function, determine the objective function of the structure prediction model used to predict the structure of proteins; as well as The structure prediction model is used to determine the predicted structure of the target protein by minimizing the objective function.

10. The device of claim 8, wherein determining the plurality of segments comprises: Determine the initial fragments allocated from the fragment library to each residue position; as well as From the initial fragment, fragments with a predetermined number of residues are generated as the plurality of fragments.

11. The device of claim 8, wherein the structural property includes at least one of the following: The angle between different types of atoms The angle between atoms of the same type, or The distance between atoms of the same type.

12. The device according to claim 7, wherein the action further includes: For each of the multiple reference fragment libraries constructed for the reference protein based on different algorithms, Determine the reference attribute values ​​for the structural properties of multiple reference fragments assigned by each reference fragment library to the reference residue positions of the reference protein; Determine the actual attribute value of the structural property of the reference protein at the reference residue position; Determine the difference between the reference attribute value and the actual attribute value; as well as Based on the corresponding differences determined for the plurality of reference fragment libraries, a target algorithm is selected from the plurality of algorithms to construct the fragment library for the target protein.

13. A computer program product tangibly stored in a non-transitory computer storage medium and comprising machine-executable instructions that, when executed by a device, cause the device to perform an action, the action comprising: From a fragment library targeting a target protein, multiple fragments are determined for each residue position in a plurality of residue positions of the target protein, each fragment comprising multiple amino acid residues; For each residue position, a first feature representation of the structure of the plurality of fragments is generated; as well as Based on the corresponding first feature representations generated for the plurality of residue positions, a prediction of at least one of the structure and structural properties of the target protein is determined. The generation of the first feature representation includes: For each residue position, based on the structure of the plurality of fragments, determine multiple structural properties of each fragment; and The first feature representation is determined by encoding the plurality of structural attributes of each of the plurality of segments using a trained feature encoder. The determination of the predicted structure of the target protein includes: A second characteristic representation of the amino acid sequence of the target protein is determined, the amino acid sequence indicating the residue type at each of the plurality of residue positions; and Using a trained attribute predictor, predictions of the structural attributes of the target protein are determined based on the corresponding first feature representation and second feature representation determined for the plurality of residue positions.

14. The computer program product of claim 13, wherein generating the first feature representation comprises: For each residue position, based on the structure of the plurality of fragments, the attribute value of the structural property of each fragment is determined; as well as Based on the attribute values ​​of the structural attributes of the plurality of fragments, the probability distribution of the structural attributes at each residue position is determined as the first feature representation.

15. The computer program product of claim 14, wherein determining the prediction of the structure of the target protein comprises: Based on the corresponding probability distributions at the multiple residue positions, a potential function corresponding to the structural properties is generated; Based on the potential energy function, determine the objective function of the structure prediction model used to predict the structure of proteins; as well as The structure prediction model is used to determine the predicted structure of the target protein by minimizing the objective function.