Method for clustering nucleic acid sequences, equipment and storage medium

A nucleic acid sequence and sequence technology, applied in the field of computer equipment and computer-readable storage media, can solve problems such as incomplete information and affecting the results of species analysis, and achieve the effects of ensuring authenticity, reducing errors, and ensuring reliability

Active Publication Date: 2019-08-09
BGI TECH SOLUTIONS
View PDF8 Cites 13 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Since the partial sequence cannot fully represent the overall sequence information of the 16S gene, the obtained information is not comprehensive, which will affect the results of species analysis
[0004] Therefore, the cluster analysis method for specific sequences needs to be improved

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for clustering nucleic acid sequences, equipment and storage medium
  • Method for clustering nucleic acid sequences, equipment and storage medium
  • Method for clustering nucleic acid sequences, equipment and storage medium

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0115] This example constructs a specific implementation of the above technical solution, and uses simulated data to compare the results of this patent solution with Mothur and CD-HIT.

[0116]Among them, Mothur is a hierarchical clustering method. Its principle is to calculate the distance between two sequences, merge the two sequences with the closest distance into a cluster (cluster), and then form the cluster as a sequence, repeat the above Steps, until the distance between sequences or clusters is greater than the threshold and cannot be merged. In this embodiment, refer to the document Introducing mother: Open-Source, Platform-Independent, Community-Supported Software for Describing and Comparing Microbial Communities (Patrick D.Schlossetal. APPLIED AND ENVIRONMENTAL MICROBIOLOGY, Dec.2009, Vol.75, No.23 , p.7537-7541) to obtain the results of cluster analysis, as shown in Table 1.

[0117] CD-HIT is a heuristic clustering method. The basic method is to first take the ...

Embodiment approach

[0119] as attached image 3 shown. attached image 3 A flowchart for clustering multiple nucleic acid sequences is provided. It mainly includes cluster generation module and cluster optimization module. Among them, the cluster generation module includes the following processes:

[0120] First input the sequencing data, then estimate the largest cluster center, optimize the cluster center on this basis, and generate clusters. Then remove the sequences already contained in the cluster from the cluster, check whether each sequence is classified into a cluster, if not, re-estimate the largest cluster center and perform another cycle until each sequence is classified into the same cluster to generate different clusters.

[0121] The cluster optimization module includes the following processes:

[0122] Take the largest cluster generated, calculate the number of belonging sequences and the belonging probabilities of other clusters, then eliminate the wrong cluster, and then re...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a method for clustering nucleic acid sequences, equipment, computer equipment and a computer-readable storage medium. The method comprises the steps of classifying the plurality of nucleic acid sequences based on the distance between the multiple nucleic acid sequences, thereby obtaining an initial cluster set, determining an optimized initial cluster based on the number of the nucleic acid sequences included in the initial cluster set; and determining the number of ownership sequences of the optimized initial cluster and an ownership probability based on the sequencing quality of the nucleic acid sequences and the number of the nucleic acid sequences included in the optimized initial cluster, thereby further determining the error clusters, eliminating the error clusters from the initial cluster set, and obtaining the optimized initial cluster set. Based on the method, the invention further provides the equipment for clustering the nucleic acid sequences, the computer equipment and the computer-readable storage medium. The method and the equipment can effectively reduce the error in clustering analysis, thereby applying in analysis of a specific function sequence.

Description

technical field [0001] The present invention relates to the field of gene sequencing, in particular to a method and equipment for clustering nucleic acid sequences, computer equipment and a computer-readable storage medium. Background technique [0002] Species analysis is an important method for microbial community analysis. It is to use certain biochemical or molecular markers to make judgments on the composition and structure of microbial communities. 16S rRNA is a subunit of prokaryotic ribosomal RNA. Due to its highly conserved sequence, it is often used as a marker gene for species identification. In the process of species analysis, considering that the genome / 16S sequence of some species is unknown, technically the academic community generally adopts clustering methods for analysis, and it is considered that the sequences with a distance less than a certain threshold come from the same taxon (which can be a phylum, Class, order, family, genus, species or other level...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G16B30/00G16B40/00
CPCG16B40/00G16B30/00
Inventor 徐煜朱钶锐
Owner BGI TECH SOLUTIONS
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products