Compact next generation sequencing database and efficient sequence processing using same

A compact, gene sequencing technology, applied in the field of gene analysis, which can solve the problems of increased cost and high computing cost, and achieve the effect of preserving compatibility

Inactive Publication Date: 2014-11-26
KONINKLJIJKE PHILIPS NV
View PDF1 Cites 9 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0009] The combination of the size of large genomic datasets and the rapidly decreasing cost of performing NGS means that genetic data storage is a major part of the total cost of sequencing applications and is expected to decrease as sequencing becomes cheaper and produces larger datasets. Continue growing
Furthermore, large raw read datasets translate into higher computational costs for downstream processing (such as alignment)

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Compact next generation sequencing database and efficient sequence processing using same
  • Compact next generation sequencing database and efficient sequence processing using same
  • Compact next generation sequencing database and efficient sequence processing using same

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0025] Disclosed herein is a method for formatting raw read data including base quality scores in a manner that allows for a substantial reduction in file size while preserving most of the useful information. As discussed earlier, in the regular FASTQ format, reads occupy slightly more than 2L 序列 (ASCII) characters, where L 序列 is the number of bases. Other existing text-based storage formats that store base sequences and corresponding base quality scores occupy a considerable amount of storage. For example, in the Qseq format, base sequences and quality scores are stored but arranged in a single line of text. The FASTA format is able to cut this storage roughly in half - but it does so by losing all base quality score information. Alternatively, anyone can convert a text-formatted read entry to a non-text format (eg, a binary format where two bits encode a base and the phred score is represented by a binary integer value). However, the most downstream processing components...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

In a method operative on a genetic sequencing read comprising a base sequence acquired by processing a tissue sample, a compact text representation of the genetic sequencing read is generated. The compact text representation includes (1) a text string representing the base sequence and (2) a base quality text field identifying the longest sub-sequence of the base sequence for which base quality scores of the bases of the sub sequence satisfy a base quality score threshold; and storing the compact text representation of the genetic sequencing read in a raw reads storage. To provide flexibility, the base quality text field may identify the longest sub-sequence for each of two or more different base quality score thresholds. During reads alignment, offset boundaries for the genetic sequencing reads can be efficiently chosen using the content of the base quality text field.

Description

technical field [0001] The following relates to the field of genetic analysis, and to the same application in medical fields such as including the fields of oncology, veterinary medicine, etc. Background technique [0002] Efficient gene-sequencing systems, sometimes called "next-generation sequencing" (NGS) systems, are capable of rapidly and essentially automatically sequencing entire genomes. Although NGS accuracy is sufficient for clinical applications and is expected to improve as the technology matures, existing NGS systems sometimes exhibit lower Reading Accuracy. [0003] To assess read precision (or reliability), a base quality score is typically calculated for each base of a read. In the case of Sanger sequencing, the phred quality score is calculated from the spectrogram data by calculating parameters such as peak shape and resolution for the sequenced bases, and comparing these values ​​to an empirically built lookup table. Phred scores are generally considere...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F19/22C12Q1/68G16B30/10G16B30/20
CPCG06F19/22G16B30/00G16B30/10G16B30/20
Inventor S·库马尔R·辛格B·查克拉巴蒂
Owner KONINKLJIJKE PHILIPS NV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products