Method and system for generating summary data of biological gene sequence

A gene sequence and summary data technology, applied in the field of biological data processing, can solve problems such as limited performance and lack of solution methods, and achieve the effects of reducing dependencies, improving program performance, and avoiding prediction failures

Active Publication Date: 2021-10-12
SHANDONG UNIV
View PDF10 Cites 2 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0006] The inventor found that in Minhash sketch and HyperLogLog sketch-related applications, the calculation of the hash value is one of the computationally intensive areas, and the performance of the entire program is limited by the calculation of the hash value. At present, there is still a lack of better performance Solving method;

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and system for generating summary data of biological gene sequence
  • Method and system for generating summary data of biological gene sequence
  • Method and system for generating summary data of biological gene sequence

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0053] Embodiment 1 of the present disclosure provides a method for generating summary data of biological gene sequences. For the calculation of hash values, this embodiment provides a variety of improved hash functions based on SIMD, including MurmurHash3, CityHash, xxHash, wangHash, use these hash functions to construct the hash value list of the gene sequence, and choose different hash functions for different situations, so that the applicability is wider; the implementation method of vectorization is used to make it faster.

[0054] The original hash function calculates the hash value as follows:

[0055] For the sequence data to be processed, use a sliding window to generate a K-mer, and then process the K-mer to obtain its reverse complementary strand (DNA generally presents a double-stranded structure, which is formed by two single strands coiled, and the two A single strand has complementary characteristics, that is, base pairs are formed between every two bases. This ...

Embodiment 2

[0099] Embodiment 2 of the present disclosure provides a system for generating summary data of biological gene sequences, including the following process:

[0100] The data acquisition module is configured to: acquire the gene sequence to be processed;

[0101] The K-mer decomposition module is configured to: perform K-mer decomposition on the gene sequence to be processed using a sliding window, cut out a fixed-length K-mer in sequence each time, and obtain the reverse complementary chain of the gene sequence, Encapsulate the M K-mers and the K-mers of their reverse complementary chains into vectors respectively, and use the binary mask form to compare the forward K-mer and the reverse K-mer in a vectorized manner, that is, for For each pair of forward and reverse K-mers, select the K-mer with a smaller character value, and finally leave M K-mers with a smaller character value, and vectorize the remaining M K-mers setting operation;

[0102] The hash calculation module is c...

Embodiment 3

[0107] Embodiment 3 of the present disclosure provides a computer-readable storage medium on which a program is stored. When the program is executed by a processor, the steps in the method for generating summary data of biological gene sequences as described in Embodiment 1 of the present disclosure are implemented.

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a method and a system for generating summary data of a biological gene sequence. The method comprises the following steps: acquiring a to-be-processed gene sequence; carrying out K-mer decomposition on a to-be-processed gene sequence by utilizing a sliding window, comparing M K-mers with K-mers of corresponding M reverse complementary chains, selecting K-mers with smaller character values for each pair of forward and reverse K-mers to finally obtain M K-mers, and then carrying out vectorization transposition operation; inputting the vectors obtained by the transposition operation into a Hash function improved based on single instruction multiple data stream (SIMD) to obtain Hash values corresponding to the vectors; continuously sliding the window to obtain a new sub-sequence K-mer, repeating the operation until all K-mer of the to-be-processed gene sequence calculate corresponding hash values, and constructing a hash value list of the to-be-processed gene sequence according to all the hash values; and generating summary data of the gene sequence to be processed according to the hash value list. According to the method and system, a vectorization implementation mode is adopted, the calculation speed is higher, and the biological gene sequence can be processed more efficiently.

Description

technical field [0001] The present disclosure relates to the technical field of biological data processing, in particular to a method and system for generating summary data of biological gene sequences. Background technique [0002] The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art. [0003] With the development of sequencing technology, the scale of biological gene database is getting larger and larger. From the beginning of the public gene bank data totaling less than 50 million nucleotide sequences, to now a sequencing instrument can generate more than one trillion sequences at a time, the scale of data has increased dramatically, and the data generated by new sequencing technologies Capability has surpassed "Moore's Law". In order to efficiently process genetic data, tools such as Mash and Dashing have been developed one after another. In such tools, the processing of gen...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G16B40/00G16B50/00
CPCG16B40/00G16B50/00
Inventor 刘卫国林浩然徐晓明殷泽坤
Owner SHANDONG UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products