DNA sequence k-mer frequency statistical method based on MapReduce

A DNA sequence and statistical method technology, applied in the field of bioinformatics, can solve problems such as the large proportion of I/O overhead in the total processing time, excessive computer performance requirements, and unsatisfactory processing efficiency, etc., to reduce I/O overhead and network transmission overhead, shorten processing time, and reduce the effect of calculation

Active Publication Date: 2017-05-31
CHONGQING UNIV OF POSTS & TELECOMM
View PDF6 Cites 12 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Since the BTKC algorithm needs to count all n sequences and load the results into memory, it consumes a lot of memory
And due to the need to frequently write intermediate results to the disk, the I / O overhead of the algorithm accounts for an excessively large proportion of the total processing time
Due to the above reasons, the BTKC algorithm can only process a small amount of DNA sequence data. When processing a large amount of DNA sequence data, the performance requirements of the computer are too high, and the processing efficiency is not ideal.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • DNA sequence k-mer frequency statistical method based on MapReduce
  • DNA sequence k-mer frequency statistical method based on MapReduce

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0041] The technical solutions in the embodiments of the present invention will be described clearly and in detail below with reference to the drawings in the embodiments of the present invention. The described embodiments are only some of the embodiments of the invention.

[0042] Such as figure 1 Shown is the main flow diagram of the method of the present invention, and its steps mainly include:

[0043] Step 1: Preprocessing stage. Receive the DNA sequence file that needs to be processed and the variation range parameter of k in k-mer input by the user, and the initial value is set to k 1 , the final value is set to k 2 , with k 1 ≤k≤k 2 . First, the cluster environment running the MapReduce parallel computing model automatically cuts the input DNA sequence files into data blocks of a certain size, and distributes them equally to each node. Then, each node processes the sequence files assigned to the node to remove error sequences and non-DNA coding sequences. The s...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a DNA sequence k-mer frequency statistical method based on MapReduce. The method includes the steps that a sequence file to be processed is preprocessed in a distributed cluster environment in which a MapReduce calculation model is run, and an error sequence is removed; the processed sequence file is subjected to hash processing and then serves as the input of a Map function, and all k-mer frequencies in the k variation range are computed through the Map function by utilizing a defined algorithm and then serve as the input of a Combine function; obtained intermediate results are locally merged through the Combine function and then serve as the input of a Reduce function; key value pairs with the same key sent by all Combine nodes are merged by Reduce, and final results are input. The method has the advantages of effectively processing a large-scale sequence data set, reducing the processing device performance requirement, solving the problem of large total processing time consumption of I/O overhead in the prior art, and significantly increasing the processing speed.

Description

technical field [0001] The invention relates to the fields of bioinformatics and big data processing, in particular to a MapReduce-based DNA sequence k-mer frequency statistics method. Background technique [0002] In recent years, with the development of third-generation biological sequencing technology, the biological gene sequences of various species measured by scientific research institutions and enterprises have exploded. In the face of massive biological DNA / RNA sequence data, the rapid and effective processing and analysis of these measured data poses a severe challenge to the current computer processing capabilities. [0003] DNA / RNA sequence is the storage and control center of biological genetic information. Counting the occurrence frequency of k-long subsequence in DNA / RNA sequence is a basic and important biological problem, which is called k-mer frequency counting problem. K-mer frequency has important applications in gene sequence assembly, repeated sequence ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F19/24
CPCG16B40/00
Inventor 谭军孟光伟
Owner CHONGQING UNIV OF POSTS & TELECOMM
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products