DNA sequence k-mer frequency statistical method based on MapReduce

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A DNA sequence and statistical method technology, applied in the field of bioinformatics, can solve problems such as the large proportion of I/O overhead in the total processing time, excessive computer performance requirements, and unsatisfactory processing efficiency, etc., to reduce I/O overhead and network transmission overhead, shorten processing time, and reduce the effect of calculation

Active Publication Date: 2017-05-31

CHONGQING UNIV OF POSTS & TELECOMM

View PDF6 Cites 12 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

Since the BTKC algorithm needs to count all n sequences and load the results into memory, it consumes a lot of memory

And due to the need to frequently write intermediate results to the disk, the I / O overhead of the algorithm accounts for an excessively large proportion of the total processing time

Due to the above reasons, the BTKC algorithm can only process a small amount of DNA sequence data. When processing a large amount of DNA sequence data, the performance requirements of the computer are too high, and the processing efficiency is not ideal.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0041] The technical solutions in the embodiments of the present invention will be described clearly and in detail below with reference to the drawings in the embodiments of the present invention. The described embodiments are only some of the embodiments of the invention.

[0042] Such as figure 1 Shown is the main flow diagram of the method of the present invention, and its steps mainly include:

[0043] Step 1: Preprocessing stage. Receive the DNA sequence file that needs to be processed and the variation range parameter of k in k-mer input by the user, and the initial value is set to k 1 , the final value is set to k 2 , with k 1 ≤k≤k 2 . First, the cluster environment running the MapReduce parallel computing model automatically cuts the input DNA sequence files into data blocks of a certain size, and distributes them equally to each node. Then, each node processes the sequence files assigned to the node to remove error sequences and non-DNA coding sequences. The s...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a DNA sequence k-mer frequency statistical method based on MapReduce. The method includes the steps that a sequence file to be processed is preprocessed in a distributed cluster environment in which a MapReduce calculation model is run, and an error sequence is removed; the processed sequence file is subjected to hash processing and then serves as the input of a Map function, and all k-mer frequencies in the k variation range are computed through the Map function by utilizing a defined algorithm and then serve as the input of a Combine function; obtained intermediate results are locally merged through the Combine function and then serve as the input of a Reduce function; key value pairs with the same key sent by all Combine nodes are merged by Reduce, and final results are input. The method has the advantages of effectively processing a large-scale sequence data set, reducing the processing device performance requirement, solving the problem of large total processing time consumption of I / O overhead in the prior art, and significantly increasing the processing speed.

Description

technical field [0001] The invention relates to the fields of bioinformatics and big data processing, in particular to a MapReduce-based DNA sequence k-mer frequency statistics method. Background technique [0002] In recent years, with the development of third-generation biological sequencing technology, the biological gene sequences of various species measured by scientific research institutions and enterprises have exploded. In the face of massive biological DNA / RNA sequence data, the rapid and effective processing and analysis of these measured data poses a severe challenge to the current computer processing capabilities. [0003] DNA / RNA sequence is the storage and control center of biological genetic information. Counting the occurrence frequency of k-long subsequence in DNA / RNA sequence is a basic and important biological problem, which is called k-mer frequency counting problem. K-mer frequency has important applications in gene sequence assembly, repeated sequence ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & Authority Applications(China)

IPC IPC(8): G06F19/24

CPCG16B40/00

Inventor 谭军孟光伟

Owner CHONGQING UNIV OF POSTS & TELECOMM

Features

R&D
Intellectual Property
Life Sciences
Materials
Tech Scout

Why Patsnap Eureka

Unparalleled Data Quality
Higher Quality Content
60% Fewer Hallucinations

Social media

Patsnap Eureka Blog

Learn More

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

DNA sequence k-mer frequency statistical method based on MapReduce

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology