High-performance k-mer frequency counting method and system based on clustering algorithm

A technology of clustering algorithm and counting method, which is applied in the field of bioinformatics, can solve the problems of low efficiency of Hash algorithm and waste memory overhead of GPU, and achieve the effect of avoiding insufficient memory, facilitating GPU calculation, and improving parallel speed

Active Publication Date: 2022-07-29
SICHUAN INNOVATION RES INST OF TIANJIN UNIV
View PDF4 Cites 2 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] The purpose of the present invention is to overcome the deficiencies in the prior art, to provide a high-performance k-mer frequency counting method and system based on a clustering algorithm, the method is based on a coordinate offset algorithm, a clustering algorithm of unsupervised machine learning, and uses a unique centroid Express k-mer sequence instead of hash value, which solves the low efficiency of Hash algorithm; uses CUDA data flow, cooperates with asynchronous operation, solves technical problems of GPU waste and large memory overhead, and improves calculation speed; adopts CPU+CUDA heterogeneous programming , to ensure high-speed communication and collaborative computing between CPU and GPU, and to achieve high-accuracy and high-efficiency k-mer counting

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • High-performance k-mer frequency counting method and system based on clustering algorithm
  • High-performance k-mer frequency counting method and system based on clustering algorithm
  • High-performance k-mer frequency counting method and system based on clustering algorithm

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0041] In this embodiment, as figure 1 As shown, a high-performance k-mer frequency counting method based on a clustering algorithm includes the following steps:

[0042] The CPU reads the genome format file and obtains the gene sequence reads of length L;

[0043] The CPU creates a CUDA stream that includes the following GPU kernel functions and the operations they perform:

[0044] S1: Split reads into L-K+1 k-mer sequences according to the preset subsequence length K;

[0045] S2: Take each base position in the k-mer sequence as a coordinate point, convert the multi-segment k-mer sequences obtained by splitting into corresponding coordinate arrays and add coordinate offsets;

[0046] S3: Use the clustering algorithm to calculate the centroid eigenvalues ​​of each coordinate array, and return the centroid eigenvalues ​​corresponding to all k-mer sequences and k-mer sequences to the CPU;

[0047] The CPU counts the frequency of occurrence of all k-mers according to the k-m...

Embodiment 2

[0069] In this embodiment, based on the method of Embodiment 1, the Kmeans clustering analysis algorithm in the clustering algorithm is used to replace the mean-shift clustering algorithm to calculate the centroid. The specific centroid calculation process is as follows:

[0070] After converting the k-mer sequence into coordinate points, Kmeans clustering can be used to calculate the centroid of the set of coordinate points with the mean square error of the Euclidean distance between sample points (coordinate points) as the criterion function, which can be As the centroid feature value of the k-mer sequence, a k-mer sequence is uniquely expressed. At this point, all k-mer sequences of reads in the sequencing file can be converted into centroid expressions for subsequent statistics of the occurrence frequency of k-mers.

[0071] Among them, regarding the Kmeans cluster analysis algorithm, its related concepts include: (1) K value, that is, the number of clusters to be obtained...

Embodiment 3

[0084] On the basis of the method provided in the first embodiment, the present invention further provides a high-performance k-mer frequency counting system based on a clustering algorithm. The system includes a k-mer preprocessing module, a coordinate transformation module, a Kmeans calculation module and a -mer frequency statistics module. in,

[0085] The K-mer preprocessing module is used to read the sequencing file from the disk through the CPU, transmit it to the GPU through the CUDA stream, and use the GPU to split the sequencing file into multiple k-mer sequences according to the preset subsequence length K.

[0086] The coordinate conversion module is used to take each base position in the k-mer sequence as a coordinate point, convert the multi-segment k-mer sequences obtained by splitting into corresponding coordinate arrays, and add coordinate offsets.

[0087] The Kmeans calculation module is used to calculate the centroid eigenvalues ​​of each coordinate array u...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a high-performance k-mer frequency counting method and system based on a clustering algorithm, and relates to the technical field of bioinformatics, and the method comprises the steps: reading a genome format file through a CPU, and obtaining a gene sequence reads with the length being L; transmitting the file to a GPU (Graphics Processing Unit) through a CUDA (Compute Unified Device Architecture) stream, splitting a gene sequence by adopting a character string splitting method, and carrying out k-mer sequence base coordinate value conversion; then the mass center of the coordinate array is obtained through Kmeans clustering, and the k-mer sequence and the mass center of the k-mer sequence are returned to the CPU; and finally, the CPU counts the occurrence frequency of the k-mer and outputs a k-mer frequency distribution result. According to the method, the k-mer frequency counting speed is greatly improved, the occupation of computing resources is reduced, and the method is helpful for raw signal analysts to obtain an accurate analysis result at a faster speed in a shorter time.

Description

technical field [0001] The invention belongs to the field of bioinformatics, and relates to a genome K-mer frequency counting method, in particular to a K-mer frequency counting method based on a clustering algorithm. Background technique [0002] Since Roche launched the first second-generation sequencer, Roche 454, in 2005, life science has officially entered the era of high-throughput sequencing. The launch of the Illumina series of sequencing platforms has greatly reduced the price of next-generation sequencing, making high-throughput sequencing widely popular in various research fields of life sciences. So far, the second-generation short-read sequencing technology still occupies an absolute dominant position in the global sequencing market. [0003] In the field of bioinformatics, counting the number of occurrences of each k-mer (substring of length k) in a long string is a core sub-problem, including genome assembly, error correction of sequencing reads, fast multipl...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G16B40/30G06K9/62
CPCG16B40/30G06F18/23213Y02D10/00
Inventor 李国良张也吉祥宇刘宇驰杨诗宇武晟祥陈松林谢宇涛杨月刘子祯
Owner SICHUAN INNOVATION RES INST OF TIANJIN UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products