Compression and clustering-based batch protein homology search method

A search method and protein technology, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve the problems of time-consuming, large memory space and storage space consumption, etc., to achieve the reduction of repeated sequence alignment and no gaps Extended Time Effects

Active Publication Date: 2016-10-12
DALIAN UNIV OF TECH
View PDF2 Cites 3 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] In order to solve the problem that the existing batch protein homology search takes a lot of time and consumes a lot of memor

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Compression and clustering-based batch protein homology search method
  • Compression and clustering-based batch protein homology search method
  • Compression and clustering-based batch protein homology search method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0028] The specific implementation manners of the present invention will be described in detail below in conjunction with the technical solutions and accompanying drawings.

[0029] The access mainly includes three stages of protein sequence compression, clustering and batch search.

[0030] 1. The specific steps of compressing query sequences and protein database sequences in offline state are as follows:

[0031] 1) Scan a protein sequence from left to right to create a key-entry mapping set, such as figure 1 As shown, in each key-entry mapping of the mapping set, the key is a protein sequence fragment composed of 5 amino acids, and the entry includes three attributes: sequence number, starting amino acid position, and pointer to the next sequence.

[0032] 2) Scan a new protein sequence from left to right, and the new protein sequence fragment is also composed of 4-6 amino acids; apply the Needleman-Wunsch algorithm to perform similarity between each new protein sequence f...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a compression and clustering-based batch protein homology search method and belongs to the cross field of computer application technologies and bio-technologies. The method comprises the steps of firstly performing compression operation on a query sequence and a protein database through redundancy analysis and redundancy removal processes by fully utilizing sequence similar information existent in a protein database sequence and the query sequence; secondly performing similar sub-sequence clustering on the compressed protein database; thirdly performing a search by utilizing a mapping principle based on the clustered database to discover potential results, and establishing an executable database according to the found potential result set; and finally performing a homology search in the executable database to obtain a final homology sequence. According to the method, the homology search is performed in the established executable database, so that the time for repeated sequence comparison and gapless expansion is greatly shortened.

Description

technical field [0001] The invention belongs to the intersection field of computer application technology and biotechnology, and relates to a batch protein homology search method based on compression and clustering. Background technique [0002] Batch protein homology searches are a very common task for molecular biologists. Due to the exponential growth of protein sequences, homology searches are facing a computational bottleneck. For example, when identifying proteins across species, it is necessary to search for sequences with high homology to unknown sequences in the NR database. In addition, some public databases (PDB, NR, SWISSPORT) are frequently updated, making protein homology searches increasingly computationally expensive. At the same time, with the rapid development of bioinformatics technology, the query frequency of homology search by users using protein databases is increasing day by day. Therefore, for large-scale protein databases, it is necessary to deve...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F19/24
CPCG16B40/00
Inventor 葛宏伟余景洪
Owner DALIAN UNIV OF TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products