Unlock instant, AI-driven research and patent intelligence for your innovation.

Method for judging saturation of sequencing data, computer readable medium and application

A technology of sequencing data and data, applied in the field of sequencing, can solve problems such as inability to clearly reflect the saturation of sequencing data, uneven data volume, etc., and achieve the effect of accurate judgment and guaranteed accuracy

Pending Publication Date: 2019-09-13
GENEWIZ INC SZ
View PDF4 Cites 1 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0006] The main indicator to measure the amount of sequencing data is the sequencing depth. The sequencing depth refers to the ratio of the total number of bases obtained by sequencing to the size of the genome to be tested. It can be understood as the average number of times each base in the genome is sequenced. Sequencing depth = reads Length × number of aligned reads / length of the reference sequence. Since the amount of data read from each fragment on the target fragment is not uniform during sequencing, the sequencing depth cannot clearly reflect whether the sequencing data is saturated or whether there are undetected fragments.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for judging saturation of sequencing data, computer readable medium and application
  • Method for judging saturation of sequencing data, computer readable medium and application
  • Method for judging saturation of sequencing data, computer readable medium and application

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0068] Using the 16S sequencing data of 316 cases, using the Illumina sequencing platform, using the PE250 sequencing strategy, using 97% sequence similarity as the threshold, clustering to generate a Cluster, and calculating the S value to draw its distribution map, with the mean value plus 1.282 times the variance As the threshold, define the index Saturated to measure data saturation. The interval range is 0 to 0.44, that is, the Saturated value is less than 0.44, the sequencing data is saturated, and greater than or equal to 0.44, the sequencing data is not saturated. The result is as follows figure 2 shown.

[0069] In addition, 22 cases of 16S sequencing data were used, using the Illumina sequencing platform, using the PE250 sequencing strategy, with 97% sequence similarity as the threshold, clustering to generate Clusters, and drawing the dilution curve change curve to calculate the Saturated value, according to figure 2 The obtained reference value of 0.44 judges whe...

Embodiment 2

[0072] Using 261 Arabidopsis transcriptome sequencing data, using the Illumina sequencing platform, using the SE50 sequencing strategy, using 95% sequence similarity as the threshold, clustering to generate a Cluster, and calculating the Saturated value to draw its distribution map, adding the mean value The upper 1.282 times the variance is the threshold, and the range of Saturated, an indicator for measuring data saturation, is defined as 0 to 0.48, that is, the Saturated value is less than 0.48, the sequencing data is saturated, and greater than or equal to 0.48, the sequencing data is not saturated, and the result is as follows image 3 shown.

[0073] In addition, the transcriptome sequencing data of 10 cases were used, using the Illumina sequencing platform, using the SE50 sequencing strategy, using 95% sequence similarity as the threshold, clustering to generate Clusters, and drawing the dilution curve change curve to calculate the Saturated value, according to image 3...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides a method for judging the saturation of sequencing data, a computer readable medium and an application, and relates to the field of sequencing technology. The method comprises the steps of: (a) providing the sequencing data, the sequencing data being a data set A comprising X reads; (b) clustering the X reads according to a preset sequence similarity threshold to generate N Clusters; (c) obtaining Probalility; the Probalility being the probability that the number of Clusters obtained by extracting the k-1-th read is i-1, one reads is then extracted, and the number of Clusters obtained is i; wherein k is a positive integer less than or equal to X, and i is a positive integer less than or equal to N; and (d) obtaining an index Saturated that measures the degree of saturation of the data, the more the data saturation degree index Saturated approaches zero, the more the sequencing data tend to be saturated. The method can accurately reflect the saturation degree of the sequencing data by numerical value, so that the saturation of the sequencing data can be judged more accurately, so as to ensure the accuracy of subsequent data analysis.

Description

technical field [0001] The present invention relates to the technical field of sequencing, in particular to a method for judging the saturation of sequencing data, a computer-readable medium and an application. Background technique [0002] Sequencing technology refers to the analysis of the base sequence of nucleic acid. For example, DNA sequencing is to analyze the arrangement of adenine (A), thymine (T), cytosine (C) and guanine (G) in DNA. Since the dideoxy chain termination sequencing technology established by Fredrick Sanger et al. in 1977, sequencing technology has undergone rapid development for decades. Since 2005, the emergence of next-generation sequencing technologies represented by Roche454, Illumina, Life SOLID / Ion Torrent, and PacBio RS has rapidly increased the throughput of sequencing and greatly reduced the cost of sequencing. [0003] High-throughput sequencing (High-throughput sequencing) technology can simultaneously sequence millions of DNA molecules, ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G16B30/10
Inventor 贾瑞凯叶桦肖芳郭森贾延凯廖国娟
Owner GENEWIZ INC SZ