Third generation sequencing alignment algorithm

A sequence and iterative technology, applied in the field of third-generation sequencing alignment algorithms, which can solve problems such as high error rate and confusion

Inactive Publication Date: 2018-10-23
THE BOARD OF TRUSTEES OF THE LELAND STANFORD JUNIOR UNIV
View PDF6 Cites 2 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Compared to first- and second-generation sequencing technologies, TGS tools produce longer reads, but sequencing suffers from higher error rates, mainly in the form of insertions and deletions (indels)

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Third generation sequencing alignment algorithm
  • Third generation sequencing alignment algorithm
  • Third generation sequencing alignment algorithm

Examples

Experimental program
Comparison scheme
Effect test

example

[0124] The following are examples of specific embodiments for carrying out the invention. Examples are provided for illustrative purposes only and are not intended to limit the scope of the invention in any way.

[0125] Efforts have been made to ensure accuracy with respect to numbers used (eg amounts, temperature, etc.), but some experimental errors and deviations should be granted.

example 1

[0127] Demonstration of the effect of the cosine similarity measure using the E. coli genome.

[0128] Cosine similarity is a measure used to determine the similarity between two vectors by measuring the cosine of the angle between the two vectors. To demonstrate the effect of this metric, 1000 sequences of 5000 bases in length were each selected at random positions in the E. coli genome. For each sequence, between non-overlapping windows of different lengths w = 50, 100, 500, 1000 and 5000 bases, and 10 of them within each window of sequences and average substitution rates of 15% and 35% The distance cosine (1-cosine similarity) is calculated between random mutation patterns. Figure 7-8 and 9-10 exhibit a distance cosine distribution for k=3 and k=4, respectively. Figure 7-8 and 9-10 illustrate how the distribution of the distance cosines between short k-mer count vectors at random positions is distinguishable from their mutation patterns. Furthermore, as expected, the d...

example 2

[0134] Accuracy and performance analysis using the E. coli genome

[0135] The accuracy and performance of this method was evaluated using a dataset of 20x simulated reads from the E. coli genome with average lengths of 5 kbps and 10 kbps and different sequence accuracies of 85%, 75%, 65% and 55%. read-seq using with-options (--data-typeclr --depth20 --model_qc model_qc_clr --accuracy-min 0.5 --length-average[5000|10000] --length-sd 2000 --accuracy-average [.85|.75|.65|.55] - Accuracy - sd 0.02) simulation of PBSIM (Ono et al., 2013).

[0136] In the default setting (w=500, L t = 7500, f = 2, g = 1, max-num-top-peak = 10, max-fft-block-size = 32768 are reported in Tables 1 and 2 for data sets with average sequence lengths of 5 kbps and 10 kbps, respectively Performance of k=3,4. Even with ~45% error rate, k=4 has almost perfect accuracy. As expected from Table 2, longer reads resulted in a higher overall alignment rate, particularly when mapping reads covering long repeat ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

Methods, software, and systems for aligning a read sequence to a reference sequence are disclosed. In certain embodiments, the methods, software, and systems involve determining similarity of distribution of k-mers between a region of the read sequence and a region of the reference sequence in order to determine whether the region of the read sequence maps to the region of the reference sequence.

Description

[0001] cross reference [0002] This application claims the benefit of US Provisional Patent Application No. 62 / 294,205, filed February 11, 2016, which is hereby incorporated by reference in its entirety. [0003] Statement Regarding Federally Sponsored Research or Development [0004] This invention was made with government support under Contract R01HG007834 awarded by the National Institutes of Health. The government has certain rights in this invention. Background technique [0005] Whole-genome sequencing has revolutionized biological and medically driven comprehensive characterization of DNA sequence changes, resequencing of multiple species, sequencing of microbial communities, detection of methylated regions of the genome, quantification of transcript abundance, characterization of DNA sequences present in a given sample The different isoforms of a gene, the extent to which the recognized mRNA transcript is efficiently translated, etc. Indeed, the field of pharmac...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): C12Q1/6874G06F19/28G06F19/22G16B30/10G16B40/00G16B50/00
CPCC12Q1/6874G16B50/00G16B30/00G16B30/10G16B40/00C12Q2535/122G16B45/00C12Q1/6869
Inventor W·H·王P·T·阿夫沙尔
Owner THE BOARD OF TRUSTEES OF THE LELAND STANFORD JUNIOR UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products