Method and system for realizing large-scale database clustering by double-buffer model

A technology of model implementation and clustering method, applied in the field of data processing, can solve the problem of time-consuming data reading, achieve the effect of low utilization rate and improve parallelism

Active Publication Date: 2020-07-14
SHANDONG UNIV
View PDF4 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0007] The technical problem to be solved in this application is: when the amount of data is too large, the reading of the data to be processed is time-consuming, how to use the method of double buffering to implement calculation and data reading in parallel in the program of exact matching and solving similarity taken, to calculate the time to mask the data read

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and system for realizing large-scale database clustering by double-buffer model
  • Method and system for realizing large-scale database clustering by double-buffer model
  • Method and system for realizing large-scale database clustering by double-buffer model

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0046] This embodiment discloses the general flow of the double buffering model to realize the large-scale database clustering algorithm and the MEMs maximum exact matching algorithm:

[0047] Sorting in descending length against gene sequence databases. When performing similarity matching on two sequences, the default longer one is the representative sequence, and the shorter one is the redundant sequence. Therefore, the first item after sorting must be a representative sequence, and the lower one whose similarity reaches the threshold is marked as its redundant sequence.

[0048]The implementation of the algorithm needs to build a matching dictionary first. The specific implementation method is Sparse Suffix Array (Sparse SuffixArray, SSA). A gene sequence is constructed as a sparse suffix array as a dictionary, and other gene sequences are matched with the dictionary suffix array. During the matching process, the query A certain position of the sequence adopts binary searc...

Embodiment 2

[0068] The purpose of this embodiment is to provide a double-buffering model to realize a large-scale database clustering system, including a memory, a processor, and a computer program stored on the memory and operable on the processor, and the processor implements the following when executing the program. steps, including:

[0069] Sorting in descending length for the gene sequence database;

[0070] Build a matching dictionary: sparse suffix array, build a sparse suffix array with a gene sequence as a dictionary, and match other gene sequences with the dictionary suffix array. During the matching process, a binary search is used at a certain position of the query sequence to search, and an inverse suffix is ​​used The array, the minimum common sub-prefix array, and the suffix link are optimized and upgraded. After the calculated matching value reaches the threshold, it is determined to be a redundant sequence.

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a method and a system for realizing large-scale database clustering by a double-buffer model. The method comprises the following steps: performing length decreasing sorting ona gene sequence database; constructing a matching dictionary; sparse suffix array: constructing a sparse suffix array by using a gene sequence and taking same as a dictionary, matching other gene sequences with the dictionary suffix array, adopting binary search matching search at a certain position of the query sequence in the matching process, adopting an inverse suffix array, a minimum common sub-suffix array and a suffix link for optimization and improvement, and judging the sequence as a redundant sequence after a matching value obtained by calculation reaches a threshold value. Clustering operation and redundant gene sequence removal operation based on biological gene sequences of a large-scale database both adopt accurate matching operation for the gene sequences, and double-buffermulti-thread parallel operation can process data rapid processing under the above conditions for I / O operation of large-scale data files.

Description

technical field [0001] The invention belongs to the technical field of data processing, and in particular relates to a method and system for realizing large-scale database clustering with a double buffer model. Background technique [0002] The statements in this section merely provide background information related to the present disclosure and do not necessarily constitute prior art. [0003] Gene sequencing data is doubling at a faster rate, so the most prominent core problem in the biological big data research of gene and health data is the huge amount of data. [0004] For the processing of the genome, it is based on the matching operation of the genome sequence. With this data, performing genome matching is not particularly difficult algorithmically. However, the data that needs to be operated now is terabytes or even larger. In other words, ordinary machine memory cannot carry so much data at all. Moreover, for huge genetic data, many algorithms and related process...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G16B40/30G16B50/30
CPCG16B40/30G16B50/30Y02D10/00
Inventor 刘卫国徐晓明
Owner SHANDONG UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products