Gene sequence sorting method based on combination map rarefaction

A technology of gene sequence and classification method, applied in the field of computer biological information processing, can solve problems such as difficulty in use, increase in feature space, and inability to use computers

Active Publication Date: 2013-12-25
NANJING UNIV
View PDF3 Cites 2 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

But for this problem, using 1, 2, 3, 4 order templates will increase the feature space to about 660 million, which is difficult or even impossible to use computers

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Gene sequence sorting method based on combination map rarefaction
  • Gene sequence sorting method based on combination map rarefaction
  • Gene sequence sorting method based on combination map rarefaction

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0078] Assuming a gene sequence classification problem, the gene sequence to be classified is:

[0079] A. Positive class: AAGA, denoted as d 1

[0080] B. Negative class: ATTG, denoted as d 2

[0081] If represented by a first-order template, the feature space becomes: A, C, T, G, A, C, T, G, A, C, T, G, A, C, T, G. The first four features represent the four possibilities corresponding to position 1, the 5-8 features represent the four possibilities corresponding to position 2, the 9-12 features represent the four possibilities corresponding to position 3, and the 13-16 features represent the four possibilities corresponding to position 4 Corresponding four possibilities. According to the vector representation method described above, it is finally expressed in the form of Table 1:

[0082] Table 1

[0083] category

Gene sequence vector representation

positive class

x 1 =(1,0,0,0,1,0,0,0,0,0,0,1,1,0,0,0)

negative class

x 2 =(1,0,0,0,0,0...

Embodiment 2

[0107] Algorithms used in the present invention are all written and realized by python language. The model used in the experiment is: Intel Xeon X7550 processor, the main frequency is 2.00G HZ, and the memory is 32G. The SPAMS toolkit used in the present invention is a general open source classifier training package at present.

[0108] More specifically, as figure 1 As shown, the present invention operates as follows:

[0109] 1. Group the feature space: use sparse representation to express each gene sequence as a vector, and divide the entire feature space into mutually disjoint groups. The feature space is established using the first-order, second-order, and third-order templates, and the grouping is also grouped according to the first-order, second-order, and third-order templates;

[0110] 2. Establish a directed acyclic graph between groups: establish a directed acyclic graph between groups, and assign a cost value (cost) to each edge on the graph;

[0111] 3. Classi...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a gene sequence sorting method based on combination map rarefaction. The gene sequence sorting method comprises the following steps that according to characteristics of gene sequences, the characteristics in characteristic space are divided into sets which are not overlapped, and a directed acyclic graph is built between every two sets. A sorting model based on combination map rarefaction is utilized for sorting the gene sequences. According to the gene sequence sorting method, an existing gene sequence sorting method based on combination map rarefaction is improved, and the problems that the sets are independent from one another and the large scale difference between every two sets cause descending of sorting accuracy are solved. The mode that the directed acyclic graph is built between the sets, the two problems can be solved well, and learning efficiency is improved. A logistic regression classifier based on combination map rarefaction can well select useful sets according to the built directed acyclic graph, sorting accuracy is improved, and meanwhile interpretability of the sorting model is also enhanced.

Description

technical field [0001] The invention relates to the field of computer biological information processing, in particular to a gene sequence classification method based on group and graph sparseness. Background technique [0002] With the rapid development of science and technology in today's world, a large number of biological problems need to be dealt with. However, as the amount of data becomes larger and larger, human processing cannot meet the requirements. With the rapid popularization and development of computer technology, the use of computer to automatically process biological data has become very important in both scientific research and application fields. Among them, the classification of gene sequences is a very important task. Gene sequence classification is to use a computer to arrange a category (positive class and negative class) for a sequence based on the specific base sequence. For example, in the classification task of gene sequences, it is judged whethe...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F19/24
Inventor 戴新宇付强
Owner NANJING UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products