Unlock instant, AI-driven research and patent intelligence for your innovation.

An unsupervised classification method for metagenomic contigs

A technology of metagenomic and classification methods, applied in the fields of biostatistics, bioinformatics, instruments, etc., can solve problems such as poor clustering effect of unbalanced data, and achieve the effect of improving the effect of unsatisfactory and improving classification accuracy.

Active Publication Date: 2021-07-23
JILIN UNIV
View PDF8 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, this method is less effective for clustering imbalanced data

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • An unsupervised classification method for metagenomic contigs
  • An unsupervised classification method for metagenomic contigs
  • An unsupervised classification method for metagenomic contigs

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0040] The steps of the present invention are:

[0041] ① Acquisition of contig data; the present invention is applicable to all metagenomic contig data sets, and various metagenomic data can be downloaded from network public databases. For example, the metagenomic data of the human gut can be downloaded from http: / / gutmeta.genomics.org.cn / .

[0042] ②Establishment of eigenvectors;

[0043] (1) The present invention utilizes the k-mer frequency of the DNA sequence as the classification feature of the contig, and the k-mer frequency refers to the frequency at which a subsequence of k length occurs in the contig sequence, and the k value is 4 in the present invention. Since DNA is arranged by four nucleotides, A (adenine), T (thymine), G (guanine), and C (cytosine), the dimension of 4-mer frequency is 256 dimensions.

[0044] (2) Normalize the eigenvector calculated in step (1), the method is: each element in the eigenvector is divided by the maximum value of the element in th...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

An unsupervised classification method for metagenomic contigs belongs to the technical field of bioinformatics analysis. The object of the present invention is a method for unsupervised classification of metagenomic contigs by improving the c-harmonic mean algorithm. The steps of the invention are acquisition of contig data; establishment of feature vectors; construction of a cost function by considering the volume of each class; calculation of cluster centers according to the calculation formula of cluster centers; updating of membership matrix by using the formula of membership matrix. The improved fuzzy c-harmonic mean algorithm proposed by the present invention can effectively improve the unsatisfactory shortcomings of traditional methods for unbalanced data sets, and it can be applied to the unsupervised classification of contigs to improve classification accuracy, which is a great contribution to species in metagenomics. Analysis of diversity provides a better basis.

Description

technical field [0001] The invention belongs to the technical field of bioinformatics analysis. Background technique [0002] Compared with traditional genomics research, the advantage of metagenomics technology is that it can obtain most of the genetic material in the environment without laboratory cultivation, so that the relationship between species in the environment and between species and the environment can be analyzed . However, the raw data of metagenomics are a large number of short DNA fragments (reads). According to the overlapping relationship between DNA fragments, researchers can assemble them into longer DNA sequences, which are called contigs in bioinformatics. Classifying these contigs according to their species affiliation is the basis for analyzing species diversity in metagenomics. [0003] However, due to the different genome lengths between species and the different abundances among species, the number of contigs contained by different species often...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G16B40/20
CPCG16B40/00
Inventor 刘云刘富侯涛康冰王柯姜守坤王婧媛
Owner JILIN UNIV
Features
  • R&D
  • Intellectual Property
  • Life Sciences
  • Materials
  • Tech Scout
Why Patsnap Eureka
  • Unparalleled Data Quality
  • Higher Quality Content
  • 60% Fewer Hallucinations
Social media
Patsnap Eureka Blog
Learn More