Ensemble learning method for predicting DNA protein binding site

A combination site and integrated learning technology, applied in the field of bioinformatics, can solve the problem of not taking into account the data imbalance characteristics of DNA protein binding sites, so as to improve the effect of the model, improve the accuracy, and reduce the recognition error rate.

Pending Publication Date: 2018-11-06
CHENGDU UNIV OF INFORMATION TECH
View PDF4 Cites 17 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] However, most of these methods do not take into account the unbalanced nature of the DNA protein binding site data itself, that is, the data of

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Ensemble learning method for predicting DNA protein binding site
  • Ensemble learning method for predicting DNA protein binding site
  • Ensemble learning method for predicting DNA protein binding site

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0022] In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with specific embodiments and with reference to the accompanying drawings. It should be understood that these descriptions are only exemplary and not intended to limit the scope of the present invention. In addition, in the following description, descriptions of well-known structures and technologies are omitted to avoid unnecessarily obscuring the concept of the present invention.

[0023] figure 1 Shows the schematic flow chart of the ENSEMBLE integrated learning method (ENSEMBLE-CNN) in the present invention. In the case of unbalanced data, the integrated learning method for predicting DNA protein binding sites of the present invention roughly includes the following steps:

[0024] S1) Obtain the protein sequence data of the DNA binding protein site;

[0025] S2) Preprocessing the protein seque...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to an ensemble learning method for predicting a DNA protein binding site. The method comprises the following steps of obtaining protein sequence data of the DNA protein binding site; preprocessing the protein sequence data of the DNA protein binding site; establishing input data in a one-hot coding mode; combining extracted features, establishing features of amino acid on each protein sequence, and taking the features as the input data; carrying out oversampling on positive sample data through utilization of an SMOTE algorithm; dividing negative sample data into a plurality of parts according to the positive sample quantity, and combining each part of negative samples with the positive samples to form a new data subset, thereby obtaining N data subsets; training eachdata subset through utilization of convolutional neural networks; and integrating results of N convolutional neural networks through utilization of a majority voting method, thereby obtaining a predicting result. According to the method, the DNA protein binding site predicting problem under the imbalance data condition is solved, and the predicting accuracy is improved.

Description

Technical field [0001] The invention relates to the field of bioinformatics, in particular to an integrated learning method for predicting DNA protein binding sites. Background technique [0002] At present, DNA protein binding site detection mainly uses chromatin immunoprecipitation (Chromatin Immunoprecipitation, ChIP) and protein binding microarray (protein binding microarray, PBM). The ChIP-Seq technology, which combines ChIP with second-generation sequencing technology, can efficiently detect DNA segments that interact with histones and transcription factors in the whole genome. The principle of ChIP-Seq is: firstly, the DNA fragments bound by the target protein are specifically enriched by chromatin immunoprecipitation (ChIP), and then purified and library construction is performed; then the enriched DNA fragments are subjected to high pass Quantitative sequencing. Researchers accurately locate the millions of sequence tags obtained on the genome to obtain information abo...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F19/20G06F19/18G06N3/04
CPCG06N3/045
Inventor 张永清郜东瑞王婷吴锡何嘉
Owner CHENGDU UNIV OF INFORMATION TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products