Cloud computing acceleration method for gene sequence alignment

A gene sequence and cloud computing technology, applied in computing, sequence analysis, special data processing applications, etc., can solve the problem of gene sequence data taking a long time, and achieve the effect of easy development and maintenance, code maintenance, and good flexibility

Active Publication Date: 2018-02-16
SOUTH CHINA UNIV OF TECH
View PDF5 Cites 13 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0007] The purpose of the present invention is to overcome the deficiencies of the prior art and provide a cloud computing acceleration method for gene sequence comparison, which is based on big data technology and runs a se

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Cloud computing acceleration method for gene sequence alignment
  • Cloud computing acceleration method for gene sequence alignment
  • Cloud computing acceleration method for gene sequence alignment

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0033] The present invention will be further described below in conjunction with specific examples.

[0034] Such as figure 1 As shown, the cloud computing acceleration method for gene sequence comparison provided in this embodiment includes the following steps:

[0035] S1. Preprocess the off-machine data file Fastq of the gene sequencer to ensure the integrity of the data when the data is distributed, including reading the data, merging multiple input files and saving the data to the file system.

[0036] figure 2 The style of the Fastq format file and the modified file form are given in . In the Fastq file, every four lines form the complete information of a reading sequence, namely figure 2 A data unit within a Fastq file in . Paired-end sequencing produces two Fastq files, namely figure 2 Two files Fastq1 and Fastq2 in. The data units in the two Fastq files are in one-to-one correspondence, and together constitute a complete piece of information that the gene seq...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a cloud computing acceleration method for gene sequence alignment. The method comprises the steps of 1) preprocessing an offline data file Fastq of a gene sequencer to ensure data integrity during data distribution; 2) finishing multi-node distribution for modified gene sequencing data through Spark; 3) recovering an original Fastq file format for modified gene data obtained by each node; 4) executing a gene sequence alignment program script by each node through a pipe operator in the Spark, and storing a running result in a resilient distributed dataset (RDD) of the Spark; and 5) storing the running result in a distributed file system such as an HDFS, Amazon, S3 or the like. According to the method, an alignment tool runs in a Spark framework in a simpler mode; a mechanism of the Spark can be well utilized to perform multi-machine computing scheduling, data distribution, monitoring and fault tolerance; and compared with a JNI realization mode, the development threshold is low, the code maintenance is simple, the performance is better, and the expansibility can be approximately linear.

Description

technical field [0001] The invention relates to the field of biological gene data processing, in particular to a cloud computing acceleration method for gene sequence comparison, specifically a method for accelerating a general gene sequence comparison program based on a cloud computing framework. Background technique [0002] With the development of next generation sequencing (NGS), the cost of sequencing a single gene has dropped below $1,000. At the same time, gene sequencing data is exploding, with Illumina HiSeqX TM Take Ten as an example, one run can generate 6 billion sequence information. Relevant data show that the amount of genetic data will double every 6 months, and according to this growth rate, by 2020, the annual genetic data will reach 1 exabase (every 4 bases are equal to 1 byte), and In 2025, this data will increase to 1 zettabase per year. The increase in the amount of gene sequencing data and the reduction in cost are developing at a speed far exceedi...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F19/22G06F19/28
CPCG16B30/00G16B50/00
Inventor 董守斌刘柽张铃启
Owner SOUTH CHINA UNIV OF TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products