A Novel Solution to the Unbalanced Problem of Protein Training Set Integrating Genetic Information

A technology that integrates genetics and solutions, applied in the field of protein training set imbalance problem, can solve problems such as protein training set imbalance, and achieve the effect of improving the prediction success rate and broad application prospects.

Inactive Publication Date: 2018-03-16
JINGDEZHEN CERAMIC UNIV
View PDF3 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0008] The technical problem to be solved in the present invention is to provide a new solution to the unbalanced problem of protein training sets fused with genetic information, aiming to solve the problem of unbalanced protein training sets by fusing protein evolution information and directly expanding from the sequence

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A Novel Solution to the Unbalanced Problem of Protein Training Set Integrating Genetic Information
  • A Novel Solution to the Unbalanced Problem of Protein Training Set Integrating Genetic Information

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0019] In order to make the object, technical solution and advantages of the present invention more clear, the present invention will be further described in detail below in conjunction with the examples. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.

[0020] Using the protein training set unbalanced solution of the fusion of genetic information of the present invention, the specific steps are as follows:

[0021] 1) Use the PSI-BLAST program to search the Swiss-Prot database to generate a position-specific scoring matrix (Position Specific Scoring Matrix, PSSM) for the protein sequence P

[0022] Given a human gene protein:

[0023] > AAA61157

[0024] MVPSAGQLALFALGIVLAACQALENSTSPLSADPPVAAAVVSHFNDCPDSHTQFCFHATCRFLVHEDKPACVCHSGYVGARCEHADLLAVVAASQKKQAITALVVVSIVALAVLIITCVLIHCCQVRKHCEWCRALICRHEKPSALLKGRTACCHSETLV

[0025] To calculate its position-specific scoring matrix...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The present invention provides a new solution to the imbalance problem of the protein training set fused with genetic information. The amino acids in the conserved region in the protein P sequence are kept unchanged, and the amino acids in the non-conserved region are sequentially converted according to the probability of mutation of the PSSM matrix to other amino acids. into other amino acids, so that 20 virtual proteins containing the genetic information of protein P can be obtained, and the small number of subsets in the unbalanced data set can be expanded to make the unbalanced data set into a balanced data set, which is conducive to training related predictors , which can improve the prediction success rate of the predictor. Different from the existing methods for solving unbalanced data, the present invention can integrate protein evolution information and directly expand from the sequence instead of interpolating in the discrete digital model describing the sequence information, which has obvious biological significance, so it can significantly improve The prediction success rate of related predictors has a wide application space.

Description

technical field [0001] The invention relates to the technical fields of bioinformatics, protein pseudo-amino acid components and traditional protein sequence analysis, and in particular to a new method for solving the imbalance problem of a protein training set fused with genetic information. Background technique [0002] With the completion of the sequencing of the human genome, bioinformatics has entered a new stage of development—the post-genome era. The Genome Project has produced hundreds of millions of genome sequences. How to find the answers to a series of questions from these sequences, such as how life originated, how it evolved, and how these genes make living organisms active, is the current research hotspots. These gene sequences can be analyzed from multiple levels, such as base sequence, protein, genome, etc. Since many biological phenotype properties and gene regulation are determined by the amino acid sequence of the protein, there are certain advantages in...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F19/24
Inventor 肖绚刘子
Owner JINGDEZHEN CERAMIC UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products