Hierarchical multi-label classification method for protein function prediction

A protein function and classification method technology, which is applied in the field of hierarchical multi-label classification for protein function prediction, achieves far-reaching practical application significance, saves experimental costs, and compresses expenses

Active Publication Date: 2016-11-16
NAT INST OF ADVANCED MEDICAL DEVICES SHENZHEN
View PDF1 Cites 28 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0006] Problems existing in the prior art: the hierarchical multi-label classification problem has the characteristics of both multi-label classification (Multi-label Classification) and hierarchical classification (Hierarchical Classification), that is, in this type of problem, each sample can have multiple Labels, there is a known predefined hierarchical structure relationship between labels, and each sample can have two or more labels at any level
Classical binary classification algorithms or flat multi-label classification algorithms cannot be directly used to solve this problem

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Hierarchical multi-label classification method for protein function prediction
  • Hierarchical multi-label classification method for protein function prediction
  • Hierarchical multi-label classification method for protein function prediction

Examples

Experimental program
Comparison scheme
Effect test

specific Embodiment approach 1

[0044] Specific Embodiments 1. A hierarchical multi-label classification method for protein function prediction described in this embodiment is carried out in the following steps:

[0045] The hierarchical multi-label classification method for protein function prediction is divided into two stages: training and prediction:

[0046] 1. Training stage

[0047] In the training phase, an SVM classifier is used to train the data set of each node in the class label hierarchy to obtain a set of SVM classifiers, which are called basic classifiers, where SVM is a support vector machine;

[0048] 1.1. Use proteins with known functions as training samples to form a training set, and express each protein as a multidimensional vector, and each element in the vector is called an attribute;

[0049] The content in this vector is a digital representation of real experimental results taken from standard biological databases;

[0050] In the field of machine learning, attributes refer to the ...

specific Embodiment approach 2

[0071] Specific embodiment 2. This embodiment is a further description of a hierarchical multi-label classification method for protein function prediction described in specific embodiment 1. The positive sample set is constructed according to the improved sibling principle as described in step 1.2. The specific process of negative sample set is as follows:

[0072] For each node in the GO annotation scheme, in the training set, the samples belonging to the node are used as positive samples, and the samples belonging to the sibling nodes of the node are used as initial negative samples, and at the same time, the negative samples that belong to the positive sample set are eliminated in the initial negative sample set. , and take it as the final negative sample set, that is, the negative sample set; among them, if a node has no sibling nodes, trace upwards and select samples belonging to the sibling nodes of its parent node as negative samples;

[0073] Specific symbols indicate:...

specific Embodiment approach 3

[0077] Specific embodiment three. This embodiment is a further description of a hierarchical multi-label classification method for protein function prediction described in specific embodiment one or two. The specific process is as follows:

[0078] The oversampling method for few-class samples described in the present invention is a hierarchical SMOTE few-class sample oversampling method;

[0079] Let X be a training set containing n samples, X={x 1 ,x 2 ,...,x n}, the number of sample labels is m in total, that is, there are m nodes in total. For a node i (1≤i≤m), the positive sample set is a few-class sample, that is, a sample with the function represented by the node. The sample The collection of POS is marked as POS; the negative sample set is a multi-class sample, that is, the sample that does not have the function represented by the node, and is marked as NEG; there are a total of pn samples in POS, recorded as POS={x pos1 ,x pos2 ,...x pospn}; There are nn samples...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to the field of bioinformatics and data mining, in particular to a hierarchical multi-label classification method for protein function prediction, and aims at solving the data set imbalance problem, multi-label problem and hierarchical constraint problem when the conventional classification methods are used for predicting protein functions. The method comprises the following steps of: 1, a training stage: training a data set of each node in a class label hierarchical structure by adopting an SVM classifier in the training stage so as to obtain a group of basic classifiers; and 2, a prediction stage: firstly obtaining preliminary results of unknown samples by using the group of basic classifiers obtained in the training stage in the prediction stage, and processing the results by adopting a TPR algorithm with a weight so as to obtain a final result which satisfies a hierarchical constraint condition and realize the prediction of the protein functions. The hierarchical multi-label classification method for protein function prediction is applied to the field of bioinformatics and data mining.

Description

technical field [0001] The invention relates to the fields of bioinformatics and data mining, in particular to a hierarchical multi-label classification method for protein function prediction. Background technique [0002] Protein is the most abundant component in biological cells except water, and is one of the most important macromolecular substances in living organisms. Protein is an important material basis for life activities and has many important physiological functions. Almost all life phenomena must be reflected through protein. Therefore, clarifying the biological functions of proteins is of great significance in understanding biological processes in organisms, analyzing disease mechanisms, and developing new drugs. [0003] After the protein is composed of a variety of different linear sequences of amino acid residues and folded into a specific spatial conformation, the protein has the corresponding biological activity and function. The functions of proteins are...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F19/18G06F19/24
CPCG16B20/00G16B40/00
Inventor 冯收付平石金龙郑文斌徐明珠鲁健捷贾现召周颖
Owner NAT INST OF ADVANCED MEDICAL DEVICES SHENZHEN
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products