A similar protein structure retrieval method based on hash learning

By employing a hash-based learning approach and utilizing a graph neural network model to extract local and global information about protein structures, and representing it as a binary hash code, this method solves the problems of low retrieval efficiency and high storage overhead in large-scale protein structure databases, and achieves efficient retrieval of similar protein structures.

CN117275591BActive Publication Date: 2026-06-30NANJING UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
NANJING UNIV
Filing Date
2023-09-27
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing methods for retrieving similar protein structures are inefficient and have high storage costs in large-scale protein structure databases. Traditional methods lack rich initial features of protein structures and have limitations in utilizing local and global information.

Method used

By employing a hash-based learning approach, the initial node and edge features of protein structures are utilized, and local and global information is extracted through a graph neural network model. Similar protein structures are represented as binary hash codes with close proximity, reducing storage overhead and improving retrieval speed.

Benefits of technology

While maintaining retrieval accuracy, it significantly reduces storage overhead and improves retrieval speed, resulting in a more expressive model.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN117275591B_ABST
    Figure CN117275591B_ABST
Patent Text Reader

Abstract

This invention discloses a method for retrieving similar protein structures based on hash learning: First, a protein structure dataset is acquired, and the pairwise similarity information is calculated. For the query sample, positive and negative samples are sampled. The protein structures are modeled as a graph, and node and edge features are extracted, the feature representations of nodes and edges are updated, quantization loss and similarity loss are defined, and the final loss function is obtained. The model is then trained. Based on the trained model, each collected protein structure is represented as a binary vector, resulting in a binary vector database. During retrieval, the trained model represents the new protein structure to be retrieved as a binary vector. Similar structures are retrieved by directly calculating the Hamming distance between binary vectors or by constructing an inverted index. Alternatively, the top-ranked protein structures returned based on binary vector retrieval can be reordered using other real-valued vectors or more complex algorithms. This invention reduces storage overhead and improves retrieval speed.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to a method for retrieving similar protein structures based on hash learning, and to a fast retrieval algorithm for similar protein structures, which is particularly suitable for scenarios involving efficient retrieval in large-scale protein structure databases, and belongs to the fields of bioinformatics and computer biology. Background Technology

[0002] Proteins are essential components of living organisms, playing a crucial role in maintaining the normal function of biological systems. Their biochemical functions are achieved through the binding of proteins with various ligands, making the determination of their structures extremely important. Searching for structures similar to target proteins from a large pool of proteins has numerous applications in drug design, protein function prediction, and molecular evolution.

[0003] Traditional methods for retrieving similar protein structures require prior alignment of protein sequences or structures, a time-consuming process. To improve the efficiency of searching for similar protein structures, some alignment-free methods have been proposed. These methods represent each protein structure as a real-valued vector to measure the similarity between structures. These methods can be further divided into non-learning methods and learning-based methods. Non-learning methods generate vector representations based on manually selected features, but manually selected features have limited expressive power. Learning-based methods use deep learning models to learn features from protein structures, but currently still lack rich initial features of protein structures and have limitations in utilizing local and global information.

[0004] With the widespread adoption of cryo-electron microscopy and the development of protein structure prediction technologies (such as Alphafold 2), the number of proteins with known structures has grown rapidly. Alphafold 2 has predicted the high-confidence structures of over 200 million proteins, and existing methods for searching similar protein structures would incur significant storage overhead when searching such a large database. Summary of the Invention

[0005] Objective: To address the problems and shortcomings of existing technologies, this invention provides a similar protein structure retrieval method based on hash learning to solve the retrieval problem in large-scale protein structure databases. This invention simultaneously utilizes the initial nodes and edge features of protein structures and employs a graph neural network model to extract local and global information about the structures. Similar protein structures are represented as binary hash codes with close proximity, significantly reducing storage overhead and improving retrieval speed while maintaining retrieval accuracy.

[0006] Technical solution: A method for retrieving similar protein structures based on hash learning, including the steps of training using a hash learning method and retrieving protein structure databases.

[0007] The specific steps for training using the hash learning method are as follows:

[0008] Step 100: Input the protein structure training set and calculate the similarity information between each pair of protein structures;

[0009] Step 101: For the query sample, sample one similar structure as a positive sample and k dissimilar structures as negative samples;

[0010] Step 102: Model the protein structure as a graph by constructing a graph layer and extract node and edge features;

[0011] Step 103: Define the graph neural network model and update the feature representations of nodes and edges;

[0012] Step 104: Define the quantization loss of the model through the hash layer;

[0013] Step 105: Define the similarity loss of the model through the similarity layer;

[0014] Step 106: Use the quantization loss and similarity loss to form the final loss function to train the model, output and save the model;

[0015] Step 107: Based on the trained model, each collected protein structure is represented as a binary vector to obtain a binary vector database.

[0016] The specific steps for searching the protein structure database are as follows:

[0017] Step 200: Use the trained model to represent the new protein structure to be retrieved as a binary vector;

[0018] Step 201: Calculate the Hamming distance between the binary vector of the protein structure to be retrieved and the binary vector of the structure in the binary vector database, or retrieve similar structures by constructing an inverted index through the binary vector database;

[0019] Step 202: Based on actual needs, reorder the top protein structures returned by binary vector retrieval using other real-valued vectors or more complex algorithms.

[0020] In step 100, calculating the similarity information between pairs of protein structures refers to using TM-align to calculate the similarity score TM-score between structures.

[0021] In step 102, modeling the protein structure as a graph means treating each amino acid as a graph node, using C++. α The distance between atoms is measured to determine the distance between nodes in order to construct a K-nearest neighbor graph, such that each node is connected to its K nearest neighbors.

[0022] Extracting node features in step 102 refers to calculating the bond angles α, β, γ and dihedral angles φ, ψ, ω of the protein to obtain the node features {sin,cos}ο{α,β,γ,φ,ψ,ω}.

[0023] Extracting edge features in step 102 refers to calculating C,C... α ,N,O,C β The distance between each pair of atoms is encoded using Gaussian radial basis functions.

[0024] The graph neural network model in step 103 includes a node update layer, an edge update layer, and a global information layer.

[0025] More specifically, the node update layer is

[0026]

[0027]

[0028] in, Indicates the neighboring nodes of node i. express The number of nodes in This represents the feature representation of the nodes in the l-th layer. Let represent the feature representation of the edge connecting node i and node j in layer l, ‖ represent the concatenation operation, NodeMLP represent a two-layer feedforward neural network, and Φ represent the batchnorm function.

[0029] More specifically, the edge update layer is

[0030]

[0031] EdgeMLP represents a two-layer feedforward neural network. Update the output of the node layer.

[0032] More specifically, the global information layer is

[0033]

[0034]

[0035] Among them, U tf represents the set of node indices contained in the protein structure t sampled in the current batch. c This indicates a GRU unit.

[0036] The hash layer in step 104 is

[0037] b t =sign(y t ),

[0038]

[0039] Among them, y t It is a representation of the protein structure t, which can be obtained by aggregating all the nodes contained in t. sign(·) is the sign function.

[0040] The similarity layer in step 105 is

[0041]

[0042] Where τ is the temperature coefficient, y q It is the vector representation of the query sample, y p It is a vector representation of the positive samples sampled from the query sample, where k is the number of negative samples sampled, and y is the vector representation of the positive samples sampled from the query sample. i It is the vector representation of the negative samples, all of them. The result after applying L2 normalization to y, such as For y q The result after L2 normalization;

[0043] The final loss function in step 106 is:

[0044]

[0045] Where λ is the hyperparameter balancing the two losses, Represents similarity loss. This indicates the quantified loss.

[0046] In step 202, reordering with real-valued vectors means using a non-hashing method to represent protein structures as real-valued vectors, and then measuring the similarity between protein structures by calculating the distance between the real-valued vectors in order to reorder them.

[0047] The more complex algorithm used in step 202 for reordering refers to using TM-align to calculate the similarity value TM-score between each pair of protein structures, and then reordering them based on the TM-score to measure the similarity between protein structures.

[0048] A computer device, characterized in that: the computer device includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein when the processor executes the computer program, it implements the hash-based protein structure retrieval method described above.

[0049] Beneficial effects: Compared with the prior art, the similar protein structure retrieval method based on hash learning provided by this invention makes full use of the initial features of protein structure, while incorporating local and global information, resulting in a stronger model expressive power; by using hash learning to map protein structure into binary code, storage overhead is greatly reduced and retrieval speed is improved. Attached Figure Description

[0050] Figure 1 This is a flowchart illustrating the training process of the similar protein structure retrieval method according to an embodiment of the present invention.

[0051] Figure 2 This is a flowchart illustrating the prediction workflow of the similar protein structure retrieval method according to an embodiment of the present invention. Detailed Implementation

[0052] The present invention will be further illustrated below with reference to specific embodiments. It should be understood that these embodiments are for illustrative purposes only and are not intended to limit the scope of the invention. After reading the present invention, any modifications of the present invention in various equivalent forms by those skilled in the art will fall within the scope defined by the appended claims.

[0053] A hash-based protein structure prediction method, the training workflow is as follows: Figure 1 As shown in the diagram. First, input the training data of protein structures (step 10), then calculate the similarity score between each pair of protein structures (step 11). During training, each time, sample one similar sample as a positive sample and k dissimilar samples as negative samples based on the query sample (step 12). Then, model the protein structure as a graph through a graph construction layer (step 13), update the feature representation of nodes and edges through a graph neural network model (step 14), and then calculate the quantization loss through a hash layer and the similarity loss through a similarity layer to form the final loss function (step 15). Train the model and update the model weights according to the loss function (step 16). After each iteration, determine whether the training stopping condition has been met (step 17). If the stopping condition has not been met, return to step 12 to sample a new round of data; otherwise, output the training results and save the model (step 18). Based on the trained model, represent each collected protein structure as a binary vector and output a binary vector database (step 19).

[0054] Calculating the similarity information between pairs of protein structures refers to using TM-align to calculate the similarity score TM-score between structures.

[0055] Modeling protein structure as a graph means treating each amino acid as a graph node, using C++. α The distance between atoms is measured to determine the distance between nodes in order to construct a K-nearest neighbor graph, such that each node is connected to its K nearest neighbors.

[0056] Extracting node features refers to calculating the bond angles α, β, γ and dihedral angles φ, ψ, ω of a protein, resulting in node features {sin,cos}ο{α,β,γ,φ,ψ,ω}.

[0057] Extracting edge features refers to calculating C,C α ,N,O,C β The distance between each pair of atoms is encoded using Gaussian radial basis functions.

[0058] The graph neural network model includes a node update layer, an edge update layer, and a global information layer.

[0059] The node update layer is

[0060]

[0061]

[0062] in, Indicates the neighboring nodes of node i. express The number of nodes in This represents the feature representation of the nodes in the l-th layer. Let represent the feature representation of the edge connecting node i and node j in layer l, ‖ represent the concatenation operation, NodeMLP represent a two-layer feedforward neural network, and Φ represent the batchnorm function.

[0063] The edge update layer is

[0064]

[0065] EdgeMLP represents a two-layer feedforward neural network. Update the output of the node layer.

[0066] The global information layer is

[0067]

[0068]

[0069] Among them, U t f represents the set of node indices contained in the protein structure t sampled in the current batch. c This indicates a GRU unit.

[0070] Hash layer is

[0071] b t =sign(y t ),

[0072]

[0073] Among them, y t It is a representation of the protein structure t, which can be obtained by aggregating all the nodes contained in t. sign(·) is the sign function.

[0074] Similar layers are

[0075]

[0076] Where τ is the temperature coefficient, y q It is the vector representation of the query sample, y p It is a vector representation of the positive samples sampled from the query sample, where k is the number of negative samples sampled, and y is the vector representation of the positive samples sampled from the query sample. i It is the vector representation of the negative samples, all of them. This is the result after applying L2 normalization to y;

[0077] The final loss function is

[0078]

[0079] Where λ is the hyperparameter balancing the two losses, Represents similarity loss. This indicates the quantified loss.

[0080] The workflow for using the trained model to predict the similarity of new proteins with protein structures in the target database is as follows: Figure 2 As shown in the diagram. First, the trained model and binary vector database are read (step 20). Then, the new protein structure to be retrieved is read (step 21), and the model is used to encode the new protein structure into a binary vector (step 22). The Hamming distance between the binary vector of the structure to be retrieved and the binary vectors of protein structures in the database is calculated, or an inverted index is constructed using the binary vector database to retrieve similar structures (step 23). The user chooses whether to re-sort (step 24). If no re-sorting is needed, the most similar protein structures are output; if re-sorting is needed, the top-ranked protein structures returned based on the binary vector retrieval are re-sorted using other real-valued vectors or a more complex algorithm (step 25), and the most similar protein structures are output.

[0081] Reordering with real-valued vectors refers to representing protein structures as real-valued vectors using a non-hashing method, and then measuring the similarity between protein structures by calculating the distance between the real-valued vectors in order to reorder them.

[0082] Reordering using a more complex algorithm refers to using TM-align to calculate the similarity value TM-score between protein structures pairwise, and then reordering them based on the TM-score to measure the similarity between protein structures.

[0083] Obviously, those skilled in the art should understand that the steps of the hash-based similar protein structure retrieval method described in the above embodiments of the present invention can be implemented using general-purpose computing devices. They can be centralized on a single computing device or distributed across a network of multiple computing devices. Optionally, they can be implemented using device-executable program code, thereby storing them in a storage device for execution by the computing device. Furthermore, in some cases, the steps shown or described can be performed in a different order than presented herein, or they can be fabricated as separate integrated circuit modules, or multiple modules or steps can be fabricated as a single integrated circuit module. Thus, the embodiments of the present invention are not limited to any particular hardware and software combination.

[0084] This invention was tested on two datasets, comparing the performance of the proposed method with the best existing method. The first dataset is the SCOPe v2.07 dataset, containing 13265 protein domains. The second dataset is the ind_PDB dataset, containing 1900 protein structures. The invention was trained and validated on the SCOPe dataset using five-fold cross-validation, and then tested on the ind_PDB dataset. Experimental results show that, in the SCOPe five-fold cross-validation results, the proposed method outperforms the best method in the hit ratios for AUPRC and Top-1, and is close to the best method in the hit ratios for AUROC, Top-5, and Top-10. In the ind_PDB test results, the proposed method outperforms the best method in the hit ratios for AUROC, AUPRC, Top-1, Top-5, and Top-10. Experiments on storage space show that the proposed method reduces storage space by 27 times compared to the current best method. In big data scenarios, the proposed method significantly reduces retrieval time overhead compared to real-valued vector methods.

Claims

1. A method for retrieving similar protein structures based on hash learning, characterized in that, This includes steps such as training using hash learning methods and searching protein structure databases; The specific steps for training using the hash learning method are as follows: Step 100: Input the protein structure training set and calculate the similarity information between each pair of protein structures; Step 101: For the query sample, sample one similar structure as a positive sample and k dissimilar structures as negative samples; Step 102: Model the protein structure as a graph by constructing a graph layer and extract node and edge features; Step 103: Define the graph neural network model and update the feature representations of nodes and edges; Step 104: Define the quantization loss of the model through the hash layer; Step 105: Define the similarity loss of the model through the similarity layer; Step 106: Use the quantization loss and similarity loss to form the final loss function to train the model, output and save the model; Step 107: Based on the trained model, each collected protein structure is represented as a binary vector to obtain a binary vector database; The specific steps for searching the protein structure database are as follows: Step 200: Use the trained model to represent the new protein structure to be retrieved as a binary vector; Step 201: Calculate the Hamming distance between the binary vector of the protein structure to be retrieved and the binary vector of the structure in the binary vector database, or retrieve similar structures by constructing an inverted index through the binary vector database; Step 202: Based on actual needs, reorder the top-ranked protein structures returned by binary vector retrieval using other real-valued vectors; In step 102, modeling the protein structure as a graph means treating each amino acid as a graph node. The distance between atoms is measured to determine the distance between nodes in order to construct a K-nearest neighbor graph, such that each node is connected to its K nearest neighbors. Extracting node features in step 102 refers to calculating the bond angles of the protein. and dihedral The obtained node features ; Extracting edge features in step 102 refers to calculating... The distances between each pair of atoms are encoded using Gaussian radial basis functions; The graph neural network model in step 103 includes a node update layer, an edge update layer, and a global information layer; The node update layer is in, Indicates the neighboring nodes of node i. express The number of nodes in Indicates the first l Layer node feature representation, Indicates the first l Feature representation of the edge connecting node i and node j in the layer. This indicates a concatenation operation; NodeMLP represents a two-layer feedforward neural network. This refers to the batchnorm function; The edge update layer is EdgeMLP represents a two-layer feedforward neural network. Update the output of the node update layer; The global information layer is in, This represents the set of node indices contained in the protein structure t sampled in the current batch. Indicates a GRU unit; The hash layer in step 104 is in, This is a representation of the protein structure t, which can be obtained by aggregating all the nodes contained in t. It is a symbolic function; The similarity layer in step 105 is in, It is the temperature coefficient. It is the vector representation of the query sample. It is a vector representation of the positive samples sampled from the query sample, where k is the number of negative samples sampled. It is the vector representation of the negative samples, all of them. To The result after L2 normalization; The final loss function in step 106 is: in To balance the hyperparameters of the two losses, Represents similarity loss. This indicates the quantified loss.

2. The method for retrieving similar protein structures based on hash learning according to claim 1, characterized in that, In step 100, calculating the similarity information between pairs of protein structures refers to using TM-align to calculate the similarity score TM-score between structures.

3. The method for retrieving similar protein structures based on hash learning according to claim 1, characterized in that, In step 202, reordering with real-valued vectors means using a non-hashing method to represent protein structures as real-valued vectors, and then measuring the similarity between protein structures by calculating the distance between the real-valued vectors in order to reorder them.

4. A computer device, characterized in that: The computer device includes a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, it implements the hash-based protein structure retrieval method as described in any one of claims 1-3.