Unlock instant, AI-driven research and patent intelligence for your innovation.

Non-supervision classification method for metagenome contigs

A classification method and metagenomics technology, which can be applied in special data processing applications, instruments, electronic digital data processing, etc.

Active Publication Date: 2017-04-26
JILIN UNIV
View PDF8 Cites 3 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, this method is less effective for clustering imbalanced data

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Non-supervision classification method for metagenome contigs
  • Non-supervision classification method for metagenome contigs
  • Non-supervision classification method for metagenome contigs

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0041] The steps of the present invention are:

[0042] ① Acquisition of contig data; the present invention is applicable to all metagenomic contig data sets, and various metagenomic data can be downloaded from network public databases. For example, the metagenomic data of the human gut can be downloaded from http: / / gutmeta.genomics.org.cn / .

[0043] ②Establishment of eigenvectors;

[0044] (1) The present invention uses the k-mer frequency of the DNA sequence as the classification feature of the contig. The k-mer frequency refers to the frequency of occurrence of a subsequence of k length in the contig sequence. In the present invention, the value of k is 4. Since DNA is composed of four nucleotides, A (adenine), T (thymine), G (guanine), and C (cytosine), the dimension of 4-mer frequency is 256 dimensions.

[0045] (2) Normalize the eigenvector calculated in step (1), by dividing each element in the eigenvector by the maximum value of the element in the eigenvector, namely...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a non-supervision classification method for metagenome contigs, belongs to the technical field of bioinformatics analysis, and aims to provide the non-supervision classification method for metagenome contigs by improvement on a c-harmonic mean value algorithm. The non-supervision classification method comprises the steps of obtaining contig data; establishing characteristic vectors; establishing a cost function by considering mass of each catalogue; calculating a center of clustering according to a center of clustering computational formula; and updating a membership matrix through a membership matrix formula. By virtue of the improved fuzzy c-harmonic mean value algorithm disclosed by the invention, the shortcoming of a non-ideal effect on unbalanced data set by the conventional method is overcome; and by applying the improved algorithm to the non-supervision classification for metagenome contigs, the classification precision can be improved, and a better foundation is provided for analysis of diversity of species in the metagenome.

Description

technical field [0001] The invention belongs to the technical field of bioinformatics analysis. Background technique [0002] Compared with traditional genomics research, the advantage of metagenomics technology is that it can obtain most of the genetic material in the environment without laboratory cultivation, so that the relationship between species in the environment and between species and the environment can be analyzed . However, the raw data of metagenomics are a large number of short DNA fragments (reads). According to the overlapping relationship between DNA fragments, researchers can assemble them into longer DNA sequences, which are called contigs in bioinformatics. Classifying these contigs according to their species affiliation is the basis for analyzing species diversity in metagenomics. [0003] However, due to the different genome lengths between species and the different abundances among species, the number of contigs contained by different species often...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F19/24
CPCG16B40/00
Inventor 刘云刘富侯涛康冰王柯姜守坤王婧媛
Owner JILIN UNIV
Features
  • R&D
  • Intellectual Property
  • Life Sciences
  • Materials
  • Tech Scout
Why Patsnap Eureka
  • Unparalleled Data Quality
  • Higher Quality Content
  • 60% Fewer Hallucinations
Social media
Patsnap Eureka Blog
Learn More