Protein identification method, electronic device, and storage medium
By generating target probability vectors for protein sequences using multiple self-supervised models and classification models, the problem of limited human feature expression capabilities is solved, thereby improving the efficiency and accuracy of protein sequence identification.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SHENZHEN INST OF ADVANCED TECH
- Filing Date
- 2023-05-22
- Publication Date
- 2026-06-16
AI Technical Summary
In existing technologies, when relying on manually designed features to extract key information from protein sequences, the expressive power is limited, resulting in low efficiency in protein sequence identification.
Multiple self-supervised models are used to extract information from protein sequences, generating multiple embedding vectors. Target probability vectors are generated through multiple classification models, and a prediction model is used to predict protein categories, thus avoiding reliance on manually designed features.
It improves the efficiency and accuracy of protein sequence identification, fully reflects the information inside the protein sequence, and reduces the problem of limited artificial feature expression capabilities.
Smart Images

Figure CN116646009B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of biology, and more particularly to a method for protein identification, an electronic device, and a storage medium. Background Technology
[0002] Current protein identification methods rely on manually designed complex features to extract key information from protein sequences, and then use this extracted information to identify the protein sequence. However, manually designed features often have limited representational capabilities, making it difficult for the extracted key information to fully reflect the information contained within the protein sequence, resulting in low efficiency in protein sequence identification. Summary of the Invention
[0003] In view of the above, it is necessary to provide a protein identification method, electronic device and storage medium to solve the technical problem of low protein sequence identification efficiency due to the limitation of artificially designed features in expression.
[0004] On the one hand, this application provides a protein identification method, the method comprising: acquiring a protein sequence; extracting information from the protein sequence using multiple pre-trained self-supervised models to obtain multiple embedding vectors of the protein sequence; generating a target probability vector of the protein sequence based on the multiple embedding vectors and multiple pre-trained classification models; predicting the target probability vector based on a pre-trained prediction model to obtain a prediction result of the protein category of the protein sequence.
[0005] As can be seen from the above technical solution, this application employs multiple self-supervised models to extract information from protein sequences. Since these multiple protein models are self-supervised models with different network architectures,
[0006] Therefore, it can be ensured that the extracted multiple embedding vectors represent information at different levels within the protein sequence. Furthermore, this application relies on multiple pre-trained self-supervised models to extract information from the protein sequence, rather than depending on manually designed features. This solves the technical problem of limited expressive power of manually extracted features and also improves the speed of information extraction. Since this application inputs each embedding vector into each classification model, the output of each embedding vector on each classification model can be obtained. The outputs of the multiple embedding vectors on the multiple classification models are concatenated to obtain the target probability vector. This target probability vector integrates the information extracted from the protein sequence by the multiple self-supervised models and the multiple classification models. Therefore, the target probability vector can fully reflect the information contained within the protein sequence. Predicting the target probability vector can improve the efficiency of protein sequence identification.
[0007] In some embodiments, the method further includes: acquiring a dataset, dividing the dataset to obtain a first training set and a second training set, wherein the first training set includes multiple first protein sequences and a first category label corresponding to each first protein sequence, the second training set includes multiple second protein sequences and a second category label corresponding to each second protein sequence, both the first category label and the second category label include a preset protein category; acquiring multiple initial classifiers corresponding to the multiple classification models and a prediction learner corresponding to the prediction model; training each initial classifier based on any one of the multiple self-supervised models, the multiple first protein sequences, and the first category label corresponding to each first protein sequence to obtain a classification model corresponding to each initial classifier; combining any self-supervised model and any classification model to obtain multiple pairs of combined models; extracting information from each second protein sequence based on each pair of combined models to obtain a training probability vector corresponding to each second protein sequence; using the prediction learner to predict each training probability vector to obtain a training prediction result for each second protein sequence; and generating the prediction model based on the training prediction result, the second category label, and the prediction learner.
[0008] In some embodiments, generating the prediction model based on the training prediction results, the second category label, and the prediction learner includes: calculating a training loss value based on multiple training prediction results and the second category label corresponding to each training prediction result; adjusting the parameters of the prediction learner based on the training loss value until the training loss value meets a preset condition, thereby obtaining the prediction model.
[0009] In some embodiments, calculating the training loss value based on the plurality of training prediction results and the second category label corresponding to each training prediction result includes: determining whether each training prediction result is correct based on the second category label corresponding to each training prediction result, and calculating the training loss value based on the correct training prediction result and the number of labels of the second category label.
[0010] In some embodiments, generating the target probability vector of the protein sequence based on the plurality of embedding vectors and the plurality of pre-trained classification models includes: inputting each embedding vector into the plurality of classification models to obtain a plurality of initial probability vectors for each embedding vector, and concatenating all the initial probability vectors of the plurality of embedding vectors to obtain the target probability vector.
[0011] In some embodiments, the protein category includes a preset protein category and a non-preset protein category. The step of predicting the target probability vector based on a pre-trained prediction model to obtain a prediction result of the protein category of the protein sequence includes: encoding the preset protein category and the non-preset protein category respectively to obtain a category vector; classifying the target probability vector based on the category vector; determining the predicted probability that the protein sequence belongs to the preset protein category; and determining the prediction result based on the comparison result of the predicted probability and a preset threshold.
[0012] In some embodiments, determining the prediction result based on the comparison between the predicted probability and a preset threshold includes: if the predicted probability is greater than or equal to the preset threshold, determining that the predicted result is that the protein sequence belongs to the preset protein category; or, if the predicted probability is less than the preset threshold, determining that the predicted result is that the protein sequence does not belong to the preset protein category.
[0013] In some embodiments, the preset protein categories include thermophilic proteins and cryophilic proteins.
[0014] On the other hand, this application provides a protein identification device, operating in an electronic device, the device comprising: an acquisition unit for acquiring a protein sequence; an extraction unit for extracting information from the protein sequence using multiple pre-trained self-supervised models to obtain multiple embedding vectors of the protein sequence; a generation unit for generating a target probability vector of the protein sequence based on the multiple embedding vectors and multiple pre-trained classification models; and a prediction unit for predicting the target probability vector based on a pre-trained prediction model to obtain a prediction result of the protein category of the protein sequence.
[0015] On the other hand, this application provides an electronic device, the electronic device comprising: a memory storing at least one instruction; and a processor executing the at least one instruction to implement the protein identification method.
[0016] On the other hand, this application provides a computer-readable storage medium storing at least one instruction, which is executed by a processor in an electronic device to implement the protein identification method. Attached Figure Description
[0017] Figure 1 This is a structural diagram of an electronic device provided in an embodiment of this application.
[0018] Figure 2 This is a flowchart of a protein identification method provided in an embodiment of this application.
[0019] Figure 3 This is a flowchart of a method for generating prediction results provided in an embodiment of this application.
[0020] Figure 4 This is a flowchart of a method for generating multiple self-supervised models, multiple classification models, and a prediction model provided in an embodiment of this application.
[0021] Figure 5 This is a functional block diagram of a protein identification device provided in an embodiment of this application. Detailed Implementation
[0022] It should be noted that in this application, "at least one" means one or more, and "more than one" means two or more. "And / or" describes the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent: A alone, A and B simultaneously, or B alone, where A and B can be singular or plural. The terms "first," "second," "third," "fourth," etc. (if present) in the specification, claims, and drawings of this application are used to distinguish similar objects, not to describe a specific order or sequence.
[0023] In the embodiments of this application, the terms "exemplary" or "for example" are used to indicate that something is an example, illustration, or description. Any embodiment or design that is described as "exemplary" or "for example" in the embodiments of this application should not be construed as being more preferred or advantageous than other embodiments or design. Specifically, the use of the terms "exemplary" or "for example" is intended to present the relevant concepts in a specific manner.
[0024] Thermophilic proteins, also known as heat-resistant proteins, are a class of proteins that retain their structure and function under high-temperature conditions. They typically originate from thermophilic bacteria, particularly archaea. Due to their ability to enhance protein stability, enzyme activity, and suitability for high-temperature industrial processes, thermophilic proteins have found widespread applications in biotechnology, food processing, and pharmaceuticals. For example, thermostable cellulases from thermophilic bacteria have been used to efficiently convert lignocellulosic biomass into biofuels. However, identifying and screening thermophilic proteins is a laborious and tedious process. To accelerate the development of related fields and better understand the mechanisms of thermophilic proteins, it is crucial to develop high-throughput screening methods for rapid identification.
[0025] Many computational methods have been developed for identifying thermophilic proteins. However, these methods often rely on artificially designed complex features to extract key information from the protein sequence, which is then used to identify the protein. However, artificially designed features often have limited representational power, making it difficult for the extracted key information to fully reflect the information inherent within the protein sequence, resulting in low efficiency in protein sequence identification.
[0026] To address the aforementioned problems, this application provides a protein identification method, an electronic device, and a storage medium. To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and specific embodiments.
[0027] The protein identification method provided in this application can be applied to one or more electronic devices.
[0028] like Figure 1 The diagram shown is a structural diagram of an electronic device provided in an embodiment of this application. Figure 1 In the device 1, the electronic device 1 includes, but is not limited to, a memory 12, a processor 13, and a computer program, such as a protein identification program, stored in the memory 12 and capable of running on the processor 13.
[0029] The electronic device 1 is a device that can automatically perform parameter value calculation and / or information processing according to pre-set or stored instructions. Its hardware includes, but is not limited to: microprocessors, application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), digital signal processors (DSPs), embedded devices, etc.
[0030] The electronic device 1 can be any electronic product that can interact with the user, such as a personal computer, server, tablet computer, smartphone, etc.
[0031] The electronic device 1 may further include network devices and / or user devices. The network devices include, but are not limited to, a single network server, a server group consisting of multiple network servers, or a cloud based on cloud computing consisting of a large number of hosts or network servers. Figure 1 This is merely an example of electronic device 1 and does not constitute a limitation on electronic device 1. It may include more or fewer components than shown, or combine certain components, or different components. For example, electronic device 1 may also include input / output devices, network access devices, buses, etc.
[0032] The network in which the electronic device 1 is located includes, but is not limited to: the Internet, wide area network, metropolitan area network, local area network, virtual private network (VPN), etc.
[0033] like Figure 2 The diagram shown is a flowchart of a protein identification method provided in an embodiment of this application. The order of the steps in this flowchart can be adjusted according to different needs, and some steps can be omitted. The method is executed by an electronic device, such as... Figure 1 Electronic device 1 shown.
[0034] S11, obtain the protein sequence.
[0035] In one embodiment, the protein sequence refers to a protein sequence for which a category needs to be identified. The electronic device can obtain the protein sequence from a dataset. For example, the dataset can be part or all of the data from the Uniprot database, the Brenda database, and other common databases containing protein sequences.
[0036] In one embodiment, the protein sequence may be a sequence from a dataset that needs to be classified.
[0037] In another embodiment, the protein sequence may be obtained by other means, and this application does not limit the protein sequence.
[0038] S12, multiple pre-trained self-supervised models are used to extract information from the protein sequence to obtain multiple embedding vectors of the protein sequence.
[0039] In one embodiment, the plurality of self-supervised models includes, but is not limited to, the SeqVec model, the ProtCNN model, the ProtTrans model, and the CPCProt model. These plurality of self-supervised models are pre-trained to convergence.
[0040] In another embodiment, the plurality of self-supervised models may also be other models, and this application does not limit the plurality of self-supervised models.
[0041] In some embodiments, the electronic device uses multiple pre-trained self-supervised models to extract information from the protein sequence and obtain multiple embedding vectors of the protein sequence, including: the electronic device inputs each embedding vector into each self-supervised model to obtain the corresponding output embedding vector of each embedding vector on each self-supervised model.
[0042] The process of extracting information from the protein sequence for each protein model can refer to the process of learning the representation of the protein sequence by the model in related technologies to obtain the protein embedded representation.
[0043] In this embodiment, the dimension and information content of each embedding vector are related to the self-supervised model corresponding to the embedding vector. For example, if the multiple protein models are the SeqVec model, the ProtCNN model, the ProtTrans model, and the CPCProt model, and the multiple embedding vectors are four vectors, the embedding vectors extracted by the SeqVec model have a dimension of 1024, the embedding vectors extracted by the ProtCNN model have a dimension of 1100, the embedding vectors extracted by the ProtTrans model have a dimension of 1024, and the embedding vectors extracted by the CPCProt model have a dimension of 1536.
[0044] In this embodiment, multiple self-supervised models are used to extract features from the protein sequence, without relying on manually extracted features. This solves the technical problem of limited expressive power of manually extracted features and also improves the speed of information extraction. Since each self-supervised model has a different architecture, it can extract embedding vectors at different levels and in different dimensions from the protein sequence.
[0045] In other embodiments of this application, the number of the plurality of self-supervised models can be greater in order to give the protein sequence more embedding representations.
[0046] S13, Based on the multiple embedding vectors and the multiple pre-trained classification models, generate the target probability vector of the protein sequence.
[0047] In some embodiments, the plurality of classification models include, but are not limited to: Random Forest (RF), Adaptive Boosting (AdaBoost, AB), Guided Clustering (Bagging, BA), Gradient Boosting Decision Tree (XGBoost, eXtremeGradient Boosting, XGB) and LightGBM (LGB), etc.
[0048] In one embodiment, the electronic device generates the target probability vector of the protein sequence based on the plurality of embedding vectors and the plurality of pre-trained classification models by: the electronic device inputting each embedding vector into the plurality of classification models to obtain a plurality of initial probability vectors for each embedding vector, and then the electronic device concatenating all the initial probability vectors of the plurality of embedding vectors to obtain the target probability vector.
[0049] Since the multiple classification models are different models, the process of each classification model generating each embedding vector based on each input embedding vector can refer to the classification methods or classification formulas of each classification model in related technologies.
[0050] For example, following the above embodiments, if the multiple embedding vectors are the four embedding vectors output by the SeqVec model, the ProtCNN model, the ProtTrans model, and the CPCProt model, and the multiple classification models are the AdaBoost model, the Bagging model, the XGBoost model, and the LightGBM model, the four embedding vectors are respectively input into the five classification models to obtain 20 initial probability vectors. If each initial probability vector has a dimension of 1, the 20 1-dimensional initial probability vectors are concatenated to obtain a target probability vector with a dimension of 20.
[0051] In another embodiment of this application, the plurality of classification models can be other models, and this application does not limit them. The examples of the plurality of classification models given above are merely illustrative and do not constitute a limitation on the plurality of classification models.
[0052] In this embodiment, the multiple classification models can be cascaded. Each embedding vector is classified using a model integrated from these multiple models to obtain the target probability vector, which reduces and avoids the overfitting problem of a single model. Furthermore, since the target probability vector is obtained by concatenating multiple initial probability vectors, it incorporates information from these initial probability vectors, enabling it to fully reflect the information contained within the protein sequence.
[0053] S14, predict the target probability vector based on the pre-trained prediction model to obtain the prediction result of the protein category of the protein sequence.
[0054] In one embodiment, the protein categories include preset protein categories and non-preset protein categories. The preset protein categories include, but are not limited to, thermophilic proteins and psychrophilic proteins. The non-preset protein categories are protein categories other than the preset protein categories. For example, if the preset protein category is a thermophilic protein, the non-preset protein category can be a non-thermophilic protein; or, if the preset protein category is a psychrophilic protein, the non-preset protein category can be a non-psychrophilic protein.
[0055] In some embodiments, the prediction result includes whether the protein sequence belongs to the preset protein category or not.
[0056] In some embodiments, the prediction model can be a machine learning model. For example, the prediction model can be a Support Vector Machine (SVM) model.
[0057] In this embodiment, since the target probability vector can fully reflect the information contained within the protein sequence, predicting the target probability vector can improve the accuracy of protein sequence identification.
[0058] As can be seen from the above technical solution, this application employs multiple self-supervised models to extract information from protein sequences. Since these multiple protein models are self-supervised models with different network architectures,
[0059] Therefore, it can be ensured that the extracted multiple embedding vectors represent information at different levels within the protein sequence. Furthermore, this application relies on multiple pre-trained self-supervised models to extract information from the protein sequence, rather than depending on manually designed features. This solves the technical problem of limited expressive power of manually extracted features and also improves the speed of information extraction. Since this application inputs each embedding vector into each classification model, the output of each embedding vector on each classification model can be obtained. The outputs of the multiple embedding vectors on the multiple classification models are concatenated to obtain the target probability vector. This target probability vector integrates the information extracted from the protein sequence by the multiple self-supervised models and the multiple classification models. Therefore, the target probability vector can fully reflect the information contained within the protein sequence. Predicting the target probability vector can improve the efficiency of protein sequence identification.
[0060] In some embodiments of this application, the electronic device inputs the target probability vector into the prediction model to obtain a prediction result of the protein category of the protein sequence. For example... Figure 3 The diagram shown is a flowchart of a method for generating prediction results according to an embodiment of this application, including the following steps:
[0061] S141, the electronic device encodes the preset protein category and the non-preset protein category respectively to obtain a category vector.
[0062] In some embodiments, the preset protein category includes, but is not limited to, thermophilic proteins and psychrophilic proteins. The non-preset protein category is any category other than the preset protein category. For example, if the preset protein category is a thermophilic protein, the non-preset protein category can be a non-thermophilic protein; or, if the preset protein category is a psychrophilic protein, the non-preset protein category can be a non-psychrophilic protein.
[0063] In some embodiments, the electronic device may perform one-hot encoding on the preset protein category and the non-preset protein category respectively to obtain the category vector. For example, the electronic device may encode the preset protein category as 1 and the non-preset protein category as 0 to obtain a category vector composed of 1 and 0.
[0064] In other embodiments of this application, the electronic device may encode the preset protein category and the non-preset protein category in other ways, and this application does not limit the encoding method.
[0065] S142, the electronic device classifies the target probability vector based on the category vector to determine the predicted probability that the protein sequence belongs to the preset protein category.
[0066] In some embodiments, the electronic device classifies the target probability vector based on the category vector, and the method for determining the predicted probability that the protein sequence belongs to the preset protein category can refer to the classification formulas corresponding to the sigmoid function or softmax function in related technologies.
[0067] S143, the electronic device determines the prediction result based on the comparison result between the predicted probability and the preset threshold.
[0068] In some embodiments, the electronic device determines the prediction result based on a comparison between the predicted probability and a preset threshold, including: if the predicted probability is greater than or equal to the preset threshold, the electronic device determines that the predicted result is that the protein sequence belongs to the preset protein category; or, if the predicted probability is less than the preset threshold, the electronic device determines that the predicted result is that the protein sequence does not belong to the preset protein category, or the protein sequence belongs to the non-preset protein category.
[0069] In one embodiment of this application, if the electronic device determines that the prediction result is that the protein sequence belongs to the preset protein category, the electronic device outputs first preset data to indicate that the protein sequence belongs to the preset protein category; or, if the electronic device determines that the prediction result is that the protein sequence does not belong to the preset protein category (the electronic device determines that the prediction result is that the protein sequence belongs to the non-preset protein category), the electronic device outputs second preset data to indicate that the protein sequence does not belong to the preset protein category.
[0070] The first preset data and the second preset data can be in the form of numbers, letters, or a combination of numbers and letters. The first preset data and the second preset data can be set arbitrarily, and this application does not impose any restrictions on them. For example, when both the first preset data and the second preset data are numerical values, the first preset value can be 1, and the second preset value can be 0.
[0071] In some embodiments of this application, the plurality of self-supervised models, the plurality of classification models, and the prediction model need to be generated before using them. For example... Figure 4 The diagram shown is a flowchart of a method for generating multiple self-supervised models, multiple classification models, and a prediction model according to an embodiment of this application, including the following steps:
[0072] S21, the electronic device acquires the dataset, divides the dataset, and obtains a first training set and a second training set.
[0073] In some embodiments, the dataset includes multiple protein sequences, each protein sequence having a corresponding category label. The dataset may include, but is not limited to, partial or all data from Uniprot database, Brenda database, and other common databases containing protein sequences. The first training set includes multiple first protein sequences and a first category label corresponding to each first protein sequence; the second training set includes multiple second protein sequences and a second category label corresponding to each second protein sequence. Both the first and second category labels include preset protein categories.
[0074] In some embodiments, each first protein sequence has a corresponding first category label, and each second protein has a corresponding second category label. The first category label is used to indicate the category of each first protein sequence, and the second category label is used to indicate the category of each second protein sequence.
[0075] In this embodiment, the electronic device can configure its own method for partitioning the dataset, and this application does not impose any limitations on this. For example, the electronic device can use 80% of the data in the dataset as the first training set and the category labels of the 80% of the data as the first category labels. Then, the electronic device can use 20% of the data in the dataset as the second training set and the category labels of the 20% of the data as the second category labels. The above example is merely an illustration of one way to partition the dataset and does not constitute a limitation on the way the dataset can be partitioned.
[0076] S22, the electronic device acquires multiple initial classifiers corresponding to the multiple classification models and a prediction learner corresponding to the prediction model.
[0077] In some embodiments, the plurality of initial classifiers may include the classifier corresponding to the RandomForest model, the classifier corresponding to the AdaBoost adaptive boosting model, the classifier corresponding to the Bagging model, the classifier corresponding to the XGBoost gradient boosting decision tree model, and the classifier corresponding to the LightGBM gradient boosting decision tree model, and the prediction learner may be the learner of the Support Vector Machine model (SVM).
[0078] S23, the electronic device trains each initial classifier based on any one of the multiple self-supervised models, the multiple first protein sequences, and the first category label corresponding to each first protein sequence, to obtain a classification model corresponding to each initial classifier.
[0079] In some embodiments, the electronic device inputs each first protein sequence into each self-supervised model to obtain an initial embedding vector output by each first protein sequence on each self-supervised model. Each first protein sequence has multiple initial embedding vectors corresponding to a first class label. The electronic device then inputs these multiple initial embedding vectors into multiple initial classifiers to obtain a classification vector output by each initial embedding vector on each initial classifier. The multiple classification vectors corresponding to each first protein sequence are concatenated to obtain a concatenation probability vector for each first protein sequence. The electronic device predicts the concatenation probability vector for each first protein sequence to obtain a predicted class for each first protein sequence. An initial loss value is calculated based on the predicted classes and first class labels of the multiple first proteins. The parameters of the multiple initial classifiers are adjusted based on the initial loss value until the initial loss value meets the configuration conditions, thus obtaining the classification model corresponding to the initial classifier.
[0080] In this embodiment, the generation method of the plurality of initial embedding vectors is basically the same as that of the plurality of embedding vectors described above, and the generation method of the concatenation probability vector is basically the same as that of the target probability vector described above; therefore, this application will not repeat the description. Furthermore, the prediction process of the electronic device for each concatenation probability vector can refer to the prediction process for the target probability vector described below, the calculation method of the initial loss value can refer to the calculation method of the training loss value described below, and the configuration conditions are basically the same as the preset conditions described below.
[0081] S24, the electronic device combines any self-supervised model and any classification model to obtain multiple pairs of combined models.
[0082] For example, if the multiple self-supervised models are represented by A, B, C, and D, and the multiple classification models are represented by a, b, c, d, and e, the electronic device combines any self-supervised model and any classification model to obtain multiple pairs of combined models, as shown in Table 1.
[0083] Table 1 Examples of Multiple Pair Combination Models
[0084]
[0085]
[0086] S25, the electronic device extracts information from each second protein sequence according to each pair of combined models to obtain the training probability vector corresponding to each second protein sequence.
[0087] In some embodiments, the generation method of each training probability vector is basically the same as the generation method of the target probability vector, so this application will not repeat the description.
[0088] S26, the electronic device uses the prediction learner to predict each training probability vector to obtain the training prediction result for each second protein sequence.
[0089] In some embodiments, the training prediction results are generated in a manner substantially the same as the prediction results of the protein sequence, so this application will not repeat the description here.
[0090] S27, the electronic device generates the prediction model based on the training prediction result, the second category label, and the prediction learner.
[0091] In some embodiments, the electronic device generates the prediction model based on the training prediction results, the second category label, and the prediction learner by: the electronic device calculating a training loss value based on multiple training prediction results and the second category label corresponding to each training prediction result, adjusting the parameters of the prediction learner based on the training loss value, until the training loss value meets a preset condition, and obtaining the prediction model.
[0092] The electronic device can adjust parameters such as the weights and biases of the multiple self-supervised models, the multiple initial classifiers, and the predictive learner. The preset conditions can be set independently, and this application does not impose any restrictions on them. For example, the preset conditions correspond to the training prediction results. For instance, the preset conditions could be that the training loss value decreases to a preset range, the training loss value decreases to a minimum, or the training loss value no longer changes. The preset range can be set independently, and this application does not impose any restrictions on it.
[0093] In some embodiments, the electronic device calculates the training loss value based on a plurality of training prediction results and a second category label corresponding to each training prediction result, including: the electronic device determines whether each training prediction result is correct based on the second category label corresponding to each training prediction result, and then the electronic device calculates the training loss value based on the correct training prediction results and the number of labels of the second category label.
[0094] The electronic device can compare each training prediction result with the corresponding second category label. If the protein indicated by each training prediction result is the same as the corresponding second category label, the electronic device determines that each training prediction result is correct. Alternatively, if each training prediction result is different from the corresponding second category label, the electronic device determines that each training prediction result is incorrect.
[0095] For example, if the training prediction result of any second protein sequence indicates that the second protein sequence belongs to the preset protein category, and if the second category label of the second protein sequence is the preset protein category, then the electronic device determines that the training prediction result of the second protein sequence is correct. Alternatively, if the training prediction result of any second protein sequence indicates that the second protein sequence belongs to the preset protein category, and if the second category label of the second protein sequence is not the preset protein category, then the electronic device determines that the training prediction result of the second protein sequence is incorrect.
[0096] In one embodiment, the electronic device uses an ensemble algorithm (such as a bagging algorithm or a boosting algorithm) to train the plurality of initial classifiers to obtain a model formed by cascading the plurality of classification models.
[0097] In this embodiment, the electronic device calculates the number of correct training predictions and determines the training loss value as the ratio between the number of correct predictions and the number of labels. When the training loss value no longer changes, the electronic device stops adjusting, thus obtaining the plurality of self-supervised models, the plurality of classification models, and the prediction model.
[0098] In other embodiments of this application, the training loss value can also be calculated in other ways. For example, the electronic device can determine the training loss value as the ratio between the number of incorrect training predictions and the number of labels. When the training loss value no longer decreases or decreases to a preset range, the electronic device stops adjusting, thus obtaining the plurality of self-supervised models, the plurality of classification models, and the prediction model.
[0099] like Figure 5 The diagram shown is a functional block diagram of a protein identification device provided in an embodiment of this application. The protein identification device 11 includes an acquisition unit 110, an extraction unit 111, a generation unit 112, and a prediction unit 113. The module / unit referred to in this application refers to a module / unit capable of being... Figure 1 The processor 13 in the middle acquires a series of computer-readable instruction segments that are capable of performing a fixed function, and these segments are stored in Figure 1 The memory 12 is used for this purpose. In this embodiment, the functions of each module / unit will be described in detail in subsequent embodiments.
[0100] In some embodiments, the acquisition unit 110 is used to acquire protein sequences.
[0101] The extraction unit 111 is used to extract information from the protein sequence using multiple pre-trained self-supervised models to obtain multiple embedding vectors of the protein sequence.
[0102] The generation unit 112 is used to generate a target probability vector of the protein sequence based on the plurality of embedding vectors and the plurality of pre-trained classification models.
[0103] In some embodiments, the generation unit 112 is further configured to input each embedding vector into the plurality of classification models to obtain a plurality of initial probability vectors for each embedding vector, and concatenate all the initial probability vectors of the plurality of embedding vectors to obtain the target probability vector.
[0104] The prediction unit 113 is used to predict the target probability vector based on a pre-trained prediction model to obtain a prediction result of the protein category of the protein sequence. The preset protein categories include thermophilic proteins and psychrophilic proteins.
[0105] In some embodiments, the protein category includes a preset protein category and a non-preset protein category. The prediction unit 113 is further configured to predict the target probability vector according to a pre-trained prediction model to obtain a prediction result of the protein category of the protein sequence, including: encoding the preset protein category and the non-preset protein category respectively to obtain a category vector; classifying the target probability vector based on the category vector; determining the predicted probability that the protein sequence belongs to the preset protein category; and determining the prediction result based on the comparison result of the predicted probability and a preset threshold.
[0106] In some embodiments, the prediction unit 113 is further configured to determine the prediction result based on the comparison result of the prediction probability and the preset threshold, including: if the prediction probability is greater than or equal to the preset threshold, determining that the prediction result is that the protein sequence belongs to the preset protein category; or, if the prediction probability is less than the preset threshold, determining that the prediction result is that the protein sequence does not belong to the preset protein category.
[0107] In some embodiments, the generation unit 112 is further configured to acquire a dataset, divide the dataset to obtain a first training set and a second training set, wherein the first training set includes multiple first protein sequences and a first category label corresponding to each first protein sequence, and the second training set includes multiple second protein sequences and a second category label corresponding to each second protein sequence. Both the first category label and the second category label include a preset protein category. The unit acquires multiple initial classifiers corresponding to the multiple classification models and a prediction learner corresponding to the prediction model. It trains each initial classifier based on any one of the multiple self-supervised models, the multiple first protein sequences, and the first category label corresponding to each first protein sequence to obtain a classification model corresponding to each initial classifier. It combines any self-supervised model and any classification model to obtain multiple pairs of combined models. It extracts information from each second protein sequence based on each pair of combined models to obtain a training probability vector corresponding to each second protein sequence. It uses the prediction learner to predict each training probability vector to obtain a training prediction result for each second protein sequence. Based on the training prediction result, the second category label, and the prediction learner, it generates the prediction model.
[0108] In some embodiments, the generation unit 112 is further configured to generate the prediction model based on the training prediction results, the second category label, and the prediction learner, including: calculating a training loss value based on multiple training prediction results and the second category label corresponding to each training prediction result; adjusting the parameters of the prediction learner based on the training loss value until the training loss value meets a preset condition, thereby obtaining the prediction model.
[0109] In some embodiments, the generation unit 112 is further configured to calculate a training loss value based on a plurality of training prediction results and a second category label corresponding to each training prediction result, including: determining whether each training prediction result is correct based on the second category label corresponding to each training prediction result, and calculating the training loss value based on the correct training prediction result and the number of labels of the second category label.
[0110] As can be seen from the above technical solution, this application employs multiple self-supervised models to extract information from protein sequences. Since these multiple protein models are self-supervised models with different network architectures,
[0111] Therefore, it can be ensured that the extracted multiple embedding vectors represent information at different levels within the protein sequence. Furthermore, this application relies on multiple pre-trained self-supervised models to extract information from the protein sequence, rather than depending on manually designed features. This solves the technical problem of limited expressive power of manually extracted features and also improves the speed of information extraction. Since this application inputs each embedding vector into each classification model, the output of each embedding vector on each classification model can be obtained. The outputs of the multiple embedding vectors on the multiple classification models are concatenated to obtain the target probability vector. This target probability vector integrates the information extracted from the protein sequence by the multiple self-supervised models and the multiple classification models. Therefore, the target probability vector can fully reflect the information contained within the protein sequence. Predicting the target probability vector can improve the efficiency of protein sequence identification.
[0112] In one embodiment, continuing from the preceding text... Figure 1In the description of the electronic device 1, processor 13 can be a Central Processing Unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor can be a microprocessor or any conventional processor. Processor 13 is the computing core and control center of electronic device 1, connecting various parts of electronic device 1 through various interfaces and lines, and obtaining the operating system of electronic device 1 and various installed application programs and program code, etc.
[0113] Processor 13 acquires the operating system and various installed applications of electronic device 1. Processor 13 acquires these applications to implement the steps described in the various protein identification method embodiments above, for example... Figure 2 , Figure 3 and Figure 4 The steps are shown.
[0114] For example, a computer program may be divided into one or more modules / units, such as an acceleration unit. One or more modules / units are stored in memory 12 and retrieved by processor 13 to complete this application. One or more modules / units may be a series of computer program instruction segments capable of performing a specific function, which describe the process of retrieving the computer program in electronic device 1.
[0115] The memory 12 can be used to store computer programs and / or modules. The processor 13 implements various functions of the electronic device 1 by running or retrieving the computer programs and / or modules stored in the memory 12, and by calling the data stored in the memory 12. The memory 12 may mainly include a program storage area and a data storage area. The program storage area may store the operating system, application programs required for at least one function (such as sound playback function, image playback function, etc.), etc.; the data storage area may store data created according to the use of the server, etc. In addition, the memory 12 may include non-volatile memory, such as hard disk, memory, plug-in hard disk, smart media card (SMC), secure digital (SD) card, flash card, at least one disk storage device, flash memory device, or other non-volatile solid-state storage device.
[0116] The memory 12 can be the external memory and / or internal memory of the electronic device 1. Furthermore, the memory 12 can be a physical memory, such as a memory stick, a TF card (Trans-flash Card), etc.
[0117] If the modules / units integrated in electronic device 1 are implemented as software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, all or part of the processes in the methods of the above embodiments can also be implemented by a computer program instructing related hardware. The computer program can be stored in a computer-readable storage medium, and when the computer program is acquired by a processor, it can implement the steps of the various method embodiments described above.
[0118] Computer programs include computer program code, which can be in the form of source code, object code, accessible files, or certain intermediate forms. Computer-readable media can include: any entity or device capable of carrying computer program code, recording media, USB flash drives, portable hard drives, magnetic disks, optical disks, computer memory, and read-only memory (ROM).
[0119] Combination Figure 2 The memory 12 in the electronic device 1 stores multiple instructions to implement a protein identification method. The processor 13 can acquire multiple instructions to implement: acquiring a protein sequence; extracting information from the protein sequence using multiple pre-trained self-supervised models to obtain multiple embedding vectors of the protein sequence; generating a target probability vector of the protein sequence based on the multiple embedding vectors and multiple pre-trained classification models; and predicting the target probability vector based on a pre-trained prediction model to obtain a prediction result of the protein category of the protein sequence.
[0120] Specifically, the processor 13's implementation method for the above instructions can be found in [reference needed]. Figure 2 The descriptions of the relevant steps in the corresponding embodiments are not repeated here.
[0121] In the several embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of modules is only a logical functional division, and other division methods may be used in actual implementation.
[0122] The modules described as separate components may or may not be physically separate. The components shown as modules may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs.
[0123] Furthermore, the functional modules in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or in the form of hardware plus software functional modules.
[0124] Therefore, the embodiments should be considered exemplary and non-limiting in all respects, and the scope of this application is defined by the appended claims rather than the foregoing description. Thus, all variations falling within the meaning and scope of equivalents of the claims are intended to be embraced within this application. No appended diagram markings in the claims should be construed as limiting the scope of the claims.
[0125] Furthermore, it is clear that the word "comprising" does not exclude other units or steps, and the singular does not exclude the plural. Multiple units or devices described in this application may also be implemented by a single unit or device through software or hardware. The terms "first," "second," etc., are used to indicate names and do not indicate any specific order.
[0126] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of this application and are not intended to limit it. Although this application has been described in detail with reference to preferred embodiments, those skilled in the art should understand that modifications or equivalent substitutions can be made to the technical solutions of this application without departing from the spirit and scope of the technical solutions of this application.
Claims
1. A method for protein identification, characterized in that, The method includes: Obtain the protein sequence; Multiple pre-trained self-supervised models are used to extract information from the protein sequence to obtain multiple embedding vectors of the protein sequence. Based on the multiple embedding vectors and multiple pre-trained classification models, a target probability vector for the protein sequence is generated, including: inputting each embedding vector into the multiple classification models to obtain multiple initial probability vectors for each embedding vector, and concatenating all the initial probability vectors of the multiple embedding vectors to obtain the target probability vector; The target probability vector is predicted based on the pre-trained prediction model to obtain the prediction result of the protein category of the protein sequence. The protein category includes a preset protein category and a non-preset protein category. The prediction result indicates whether the protein sequence belongs to the preset protein category. The preset protein category includes thermophilic proteins and psychrophilic proteins. The training method for the prediction model includes: acquiring a dataset; dividing the dataset to obtain a first training set and a second training set, wherein the first training set includes multiple first protein sequences and a first category label corresponding to each first protein sequence, and the second training set includes multiple second protein sequences and a second category label corresponding to each second protein sequence, wherein both the first category label and the second category label include a preset protein category; acquiring multiple initial classifiers corresponding to the multiple classification models and a prediction learner corresponding to the prediction model; training each initial classifier based on any one of the multiple self-supervised models, the multiple first protein sequences, and the first category label corresponding to each first protein sequence to obtain a classification model corresponding to each initial classifier; combining any self-supervised model and any classification model to obtain multiple pairs of combined models; extracting information from each second protein sequence based on each pair of combined models to obtain a training probability vector corresponding to each second protein sequence; using the prediction learner to predict each training probability vector to obtain a training prediction result for each second protein sequence; and generating the prediction model based on the training prediction result, the second category label, and the prediction learner.
2. The protein identification method as described in claim 1, characterized in that, The step of generating the prediction model based on the training prediction result, the second category label, and the prediction learner includes: The training loss value is calculated based on the multiple training prediction results and the second category label corresponding to each training prediction result; The parameters of the prediction learner are adjusted based on the training loss value until the training loss value meets the preset conditions, thereby obtaining the prediction model.
3. The protein identification method as described in claim 2, characterized in that, The step of calculating the training loss value based on the multiple training prediction results and the second category label corresponding to each training prediction result includes: Based on the second category label corresponding to each training prediction result, determine whether each training prediction result is correct; The training loss value is calculated based on the correct training prediction result and the number of labels in the second category.
4. The protein identification method as described in claim 1, characterized in that, The step of predicting the target probability vector based on a pre-trained prediction model to obtain a prediction result for the protein category of the protein sequence includes: The preset protein categories and the non-preset protein categories are encoded respectively to obtain category vectors; Based on the category vector, the target probability vector is classified to determine the predicted probability that the protein sequence belongs to the preset protein category; The prediction result is determined based on the comparison between the predicted probability and the preset threshold.
5. The protein identification method as described in claim 4, characterized in that, Determining the prediction result based on the comparison between the predicted probability and a preset threshold includes: If the predicted probability is greater than or equal to a preset threshold, the prediction result is determined to be that the protein sequence belongs to the preset protein category; or If the predicted probability is less than the preset threshold, the prediction result is determined to be that the protein sequence does not belong to the preset protein category.
6. An electronic device, characterized in that, The electronic device includes: Memory, storing at least one instruction; and The processor executes the at least one instruction to implement the protein identification method as described in any one of claims 1 to 5.
7. A computer-readable storage medium, characterized in that: The computer-readable storage medium stores at least one instruction, which is executed by a processor in an electronic device to implement the protein identification method as described in any one of claims 1 to 5.