Flexible distributed sequence alignment system and method based on Spark and SIMD

A sequence comparison and distributed technology, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve problems such as unsatisfactory performance, achieve good scalability, and solve the effect of limited scalability

Inactive Publication Date: 2017-11-17
UNIV OF SCI & TECH OF CHINA
View PDF1 Cites 6 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Although SparkSW has good scalability, its performance is not ideal

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Flexible distributed sequence alignment system and method based on Spark and SIMD
  • Flexible distributed sequence alignment system and method based on Spark and SIMD
  • Flexible distributed sequence alignment system and method based on Spark and SIMD

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0034] The architecture diagram of the elastic distributed sequence alignment system DSA based on Spark and SIMD technology, such as figure 1As shown, DSA adopts a standard Master-slave architecture, mainly including a Master and several Slaves. The Master is mainly responsible for managing metadata and clusters. Each Master node mainly includes the Spark Master, Alluxio Master, and HDFS NameNode. Worker, also known as Slave, is mainly responsible for data storage and calculation. Generally, there are several worker nodes. Each Worker node mainly includes two layers, the storage layer and the computing layer. In the storage layer, in order to speed up data reading and writing, the memory-based distributed file system Alluxio is used as the main storage component, replacing the traditional disk-based distributed file system. In DSA, HDFS is only used for data persistence. The second layer is the computing layer, which is mainly based on the memory distributed computing frame...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a flexible distributed sequence alignment system based on Spark and SIMD. The system includes a master node and multiple working nodes connected to the master node; the master node is used for management of metadata and clusters and includes a master node body based on the distributed type computational frame Spark, a master node body based on a distributed type memory file system and a master node body of a Hadoop distributed type file system; the working nodes are used for data storage and calculation and includes a storage layer and a calculation layer; the storage layer includes Alluxio and HDFS, the calculation layer includes the Spark and an SIMD instruction set, and according to the distributed type computational frame Spark, a sequence alignment algorithm based on the SIMD is called through a mediation module for sequence alignment. The Alluxio and the HDFS are used for distributed storage of data, the Spark is used for distributed type calculation, the SIMD technology is adopted at each node for sequence alignment, and performance is improved.

Description

technical field [0001] The present invention relates to a system and method for sequence comparison, in particular to a system and method for elastic distributed sequence comparison based on Spark and SIMD. Background technique [0002] Sequence alignment is used to identify highly similar regions between two sequences. Generally, the alignment score is used to evaluate the similarity between two sequences. In order to facilitate subsequent analysis, the optimal alignment path will be calculated. Sequence alignment algorithm is a basic and crucial algorithm in the field of bioinformatics, which is widely used in gene string matching, local re-alignment and other calibration operations, variation analysis, protein database search and other fields. Sequence alignment includes local sequence alignment, global sequence alignment and semi-global sequence alignment, etc. Currently the most commonly used local sequence alignment algorithm is the Smith-Waterman (SW) algorithm, whic...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F19/10G06F17/30
CPCG06F16/182G16B99/00
Inventor 徐波王超周学海李曦陈香兰李昌龙庄航王茄力王庆凤
Owner UNIV OF SCI & TECH OF CHINA
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products