A Parallel Gene Assembling Method Based on Cluster Graph Structure

A technology of gene splicing and graph structure, applied in the field of bioinformatics

Inactive Publication Date: 2017-10-27
TIANJIN POLYTECHNIC UNIV
View PDF5 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0013] However, in the Nature report published by ALKAN et al. in 2011, it was pointed out that the result of template-free assembly of the human genome using short read lengths was 16% shorter than that obtained by using long read lengths.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A Parallel Gene Assembling Method Based on Cluster Graph Structure
  • A Parallel Gene Assembling Method Based on Cluster Graph Structure
  • A Parallel Gene Assembling Method Based on Cluster Graph Structure

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0067] A parallel gene mosaic algorithm based on the cluster graph structure, which includes creating a cluster graph and building a parallel framework;

[0068] The creation of the cluster map refers to: according to the mapping results between the original gene data (short read length) and the long sequence (scaffold) generated by other algorithms, the similarity and matching degree of the scaffold are calculated, and then the clustering is performed. Two matching scaffolds constitute a scaffold pair (scaffold-pair), and all scaffold-pairs have multiple matching areas, and these areas are used as nodes, and the connections between them form edges to create a cluster graph;

[0069] Building a parallel framework refers to: running through the steps of the entire gene splicing algorithm, including reading and writing files, building indexes, short-read mapping, scaffold clustering, building cluster graphs, and search paths; The tasks in each step are divided, executed, and mer...

Embodiment 2

[0087] A parallel gene splicing algorithm based on the cluster graph structure proposed by the present invention can run on multiple operating systems (Linux, Mac, Windows), and the running method is very simple. The specific mode of operation of the program includes the following steps:

[0088] (1) Install all software package dependencies in the claims on the operating system;

[0089] (2) Prepare two types of data, the first data is the short sequence of the original paired-end sequencing gene, and the second is the output (long sequence) obtained by using data one as the input of multiple other gene splicing algorithms;

[0090] (3) Modify the path and parameters in the config.cfg file;

[0091] #------------input----------------

[0092] #########Mapping reads#####

[0093] Kmer_Size=30

[0094] Available_Processor_Num=20

[0095]Read_1= / home / ub / genome / realdata / SRR034959 / fasta / SRR034959_1.fasta

[0096] Read_2= / home / ub / genome / realdata / SRR034959 / fasta / SRR034959_2.fa...

Embodiment 3

[0121] The following table is the method of the present invention and the existing three conventional gene splicing algorithms (ABySS, Velvet, SOAPdenove)

[0122] In Escherichia coli K-12 MG1655 (NCBI SRA accession

[0123] ERR022075, http: / / trace.ncbi.nlm.nih.gov / Traces / sra / sra.cgiview=run_browser&run=ERR022075) Comparison of the results of the data set, where #Aby indicates the experiment number of the algorithm ABySS, #Vel indicates the algorithm Velvet #Soa represents the experiment number of the algorithm SOAPdenove, and #Cob represents the experiment number of the present invention, obviously the advantages of the present invention are very obvious.

[0124]

[0125] in conclusion:

[0126] (1) The method of the present invention greatly increases the length of the scaffold sequence. It has been tested that the percentage of the length of the longest sequence obtained on the E. coli gene data set is 50% higher than that of other conventional algorithms.

[0127] (...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a parallel gene splicing algorithm based on a cluster map structure. According to the parallel gene splicing algorithm based on the cluster map structure, a long sequence (scaffold) obtained by the splicing of a plurality of other gene splicing algorithms and a short-read long gene sequence (read-pair) generated by a double-end sequencer are used as input, and complementary scaffold are spliced to be the longer sequence by building index, mapping read-pair and scaffold clusters, building cluster map, searching path and other steps. The two steps of building index and reading length mapping aim at obtaining correlations and matching degrees of long sequences scaffold obtained by different algorithms by reading length, and clustering according to the correlations and matching degrees; all scaffold in the cluster are complementary and are potential splicing sequences. At last the cluster map is built to solve the overall longest path of the map, and thus the spliced long gene sequence is obtained.

Description

technical field [0001] The invention belongs to the technical field of bioinformatics, and in particular relates to a novel parallel gene splicing algorithm based on a cluster graph structure. Background technique [0002] Since the "Nature" magazine reported on May 18, 2006, scientists have completed the sequencing of the first human chromosome, which contains 223 million base pairs and accounts for about 8% of the total base pairs in the human genome. The Human Genome Project is fully completed. As an important milestone in the history of human natural science, the research on "human genome" has entered the stage of "functional genome" from the stage of "structural genome". The Rice Genome Project, Potato Genome Project, and Grass Carp Genome Project launched successively after the Human Genome Project and the rapid growth of microbial gene sequencing and the accumulation of "massive" genetic information gave birth to the "functional genome" era. Bioinformatics, aiming t...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F19/18
Inventor 陈科徐魁
Owner TIANJIN POLYTECHNIC UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products