Eureka AIR delivers breakthrough ideas for toughest innovation challenges, trusted by R&D personnel around the world.

A method and system for distributed storage of gene variation data

A distributed storage and gene mutation technology, which is applied in chemical information database systems, chemical data mining, and chemical informatics data warehouses, etc., can solve problems such as high data maintenance costs, large data flow delays, and poor scalability, and achieve high The effect of batch processing efficiency, reduction of data redundancy, and good random read ability

Active Publication Date: 2020-08-18
SOUTH CHINA UNIV OF TECH
View PDF7 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The genome-wide association analysis scenario requires both low-latency random read performance and efficient batch read and write performance. An inappropriate storage architecture may lead to problems such as low efficiency, complex models, and low scalability. It is necessary to design a suitable Storage architecture to improve the efficiency of genome-wide association analysis
[0003] The storage scheme based on Hadoop Distributed File System (HDFS) stores mutation detection files (VCF files) in the form of Block blocks on multiple nodes. It has strong scalability and can efficiently respond to batch analysis tasks, but it cannot provide low-cost Delayed random data access, also unable to provide data update operations
The HBase-based storage solution uses key-value pairs to store VCF files. HBase is a distributed database that can be easily expanded to multiple nodes. Based on HBase, low-latency random read and write can be achieved, but because HBase is a column cluster storage, and store key-value pairs, its scan overhead is relatively large, and efficient batch analysis operations cannot be achieved
The hybrid architecture based on HDFS+HBase can achieve low-latency random read and write and efficient batch analysis, but the model of this architecture is complex, the cost of data maintenance is high, and the delay from data generation to data flow that can be analyzed in batches is large
In addition, there are some genotype query tools, such as gqt, which create bitmap indexes on the basis of VCF files to speed up retrieval, but this tool can only complete part of the functions required by the scene, and more complex queries need to combine multiple tools , and most of these tools are processed by a single node, and the scalability is poor

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A method and system for distributed storage of gene variation data
  • A method and system for distributed storage of gene variation data
  • A method and system for distributed storage of gene variation data

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0048] In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention. In addition, the technical features involved in the various embodiments of the present invention described below can be combined with each other as long as they do not constitute a conflict with each other.

[0049] Such as figure 1 As shown, the genetic variation data distributed storage method provided by the present invention includes the following steps:

[0050] S1. Preprocess the VCF file, cut off the VCF head, vertically split the VCF file into two parts: metadata information and sample genotype information, and further vertically split the sample genotype data into more parts according to th...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

A distributed storage method and architecture for gene variation data. The method comprises a distributed data storage process, a distributed bitmap index creation process, and a distributed query and retrieval process. The architecture comprises a distributed column storage module, a distributed bitmap index module, and a query and retrieval module. In the method, data distributed storage is performed by using a new column-type storage engine kudu, and distributed local bitmap indexes are established for sample columns, accordingly, the problem of low random data access performance of an existing HDFS solution is effectively resolved; the problem of poor batch analysis performance of an HBase solution is resolved; a storage architecture model is simplified; the limitation problem of dependence of a genotype query tool on multiple tools is resolved; and by means of a distributed local bitmap index solution, high concurrency is implemented and the expandability is improved.

Description

technical field [0001] The invention relates to the field of big data storage, in particular to a method and system for distributed storage of genetic variation data based on columnar storage and bitmap indexing. Background technique [0002] With the rapid development of gene sequencing technology and the urgent need for personalized medicine, genome-wide association analysis has become an increasingly popular research field. Genome-wide association analysis relies on large-scale genetic variation detection data. These data belong to the typical big data category. The data organization, indexing, and expansion methods of different storage architectures will have a great impact on data retrieval and analysis. The genome-wide association analysis scenario requires both low-latency random read performance and efficient batch read and write performance. An inappropriate storage architecture may lead to problems such as low efficiency, complex models, and low scalability. It is ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G16C20/90G16C20/70
CPCG16B40/00G16B50/00
Inventor 董守斌王博董守玲袁华
Owner SOUTH CHINA UNIV OF TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Eureka Blog
Learn More
PatSnap group products