Unlock instant, AI-driven research and patent intelligence for your innovation.

A compression method and system for sampling dictionary tree index for genetic data

A technology of gene data and compression method, which is applied in the field of compression of sampling dictionary tree index, can solve the problems of affecting compression speed and time consumption, and achieve the effect of ensuring compression rate, reducing matching time and improving speed

Active Publication Date: 2020-07-10
INST OF COMPUTING TECH CHINESE ACAD OF SCI
View PDF8 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Overall, this method achieves a high compression rate, but it needs to read the reference file to construct the index table before performing the compression. This process is time-consuming and affects the compression speed.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A compression method and system for sampling dictionary tree index for genetic data
  • A compression method and system for sampling dictionary tree index for genetic data
  • A compression method and system for sampling dictionary tree index for genetic data

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0049] In order to make the above-mentioned features and effects of the present invention more clear and understandable, the following specific examples are given together with the accompanying drawings for detailed description as follows.

[0050] system structure:

[0051] The distributed DNA file compression system mainly realizes the compressed storage function of gene files. The system consists of three parts: Client, Server and Compressor. The interrelationships of these three parts are as follows: Figure 4 shown.

[0052] In the distributed compression system, the client side is directly facing the user, and the user can initiate write (compression), read (decompression), query, and delete requests to perform data block, and all requests and data will be sent to the server side .

[0053] The Server acts as a bridge, connecting the Client and Compressor nodes. Server maintains a request queue for receiving data and requests. After receiving the request, the serve...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to a sampling dictionary tree index-based compression method and system for gene data. The method comprises the steps: a user uploads to-be-compressed gene data, wherein the genedata comprises an identifiers, a sequence and a quality score; and a preset-length substring in the sequence is extracted to be looked up in a dictionary tree index structure, if the dictionary treeindex structure contains the substring, the substring is compressed to have the position and length in the dictionary tree index structure, the position and the length serve as index values of the substring, and otherwise, the substring is added into the dictionary tree index structure, and the position and length of the substring in the dictionary tree index structure are recorded as index valuesof the substring. According to the method and system, the compression effect of the sequence is improved, and whether or not the substring is added into the index structure by using the sampling indexes according to the quality score to decrease the memory occupancy space of a dictionary tree is decreased

Description

technical field [0001] The invention relates to the field of DNA data compression, in particular to file compression in FASTQ format, in particular to a compression method and system for sampling dictionary tree indexes of genetic data. Background technique [0002] In recent years, the research of DNA data has a wide range of application fields, including important fields and disciplines such as genetic engineering, medical diagnosis, forensic biology and genetic genealogy, and the DNA sequencing project, which provides basic data for these research fields, has gradually become a national standard. key research projects. At the same time, with the continuous reduction of sequencing costs, the data obtained by using modern sequencing technology has reached the PB level. According to the official statistics of the National Center for Biotechnology Information (NCBI), as of November 21, 2018, the number of sequence bases stored in SRA (Sequence Read Archive) through NGS seque...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): H03M7/30
CPCH03M7/3059
Inventor 高艳珍包小圳邢晶魏征霍志刚马捷张佩珩
Owner INST OF COMPUTING TECH CHINESE ACAD OF SCI