Sequence-order dependent frequency matrix-based biological sequence evolution information extraction method and application thereof

A biological sequence, frequency matrix technology, applied in the field of bioinformatics, can solve the problem of inability to extract adjacent position dependent information

Inactive Publication Date: 2017-03-22
HARBIN INST OF TECH SHENZHEN GRADUATE SCHOOL
View PDF5 Cites 3 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Overcoming the defect that PSSM in the prior art cannot extract information dependent on adjacent positions, the present invention improves the prior art method of counting the frequencies of nucleotides or amino acids occurring at different positions in multiple sequence alignments, so that the statistical probability distribution Information contains sequence site dependency information

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Sequence-order dependent frequency matrix-based biological sequence evolution information extraction method and application thereof
  • Sequence-order dependent frequency matrix-based biological sequence evolution information extraction method and application thereof
  • Sequence-order dependent frequency matrix-based biological sequence evolution information extraction method and application thereof

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0034] Taking a protein sequence as an example, for any protein sequence, first use a sequence alignment tool such as PSI-BLAST to search a large-scale protein database such as NRdb90 to obtain the multiple sequence alignment MSA of the query protein. Then the frequency of amino acid sequence substrings with a length of 3 appearing at each position in the multiple sequence alignment MSA was counted. Such as figure 1 As shown in , the histogram in each column represents the probability distribution of the occurrence of amino acid substrings at this position, and the label of each row is the type of amino acid substrings. For a protein of length L, generate a corresponding SDFM of size 20 k ×(L-2). The process of generating sequence-dependent frequency matrix k=3 of protein sequence is as follows: figure 1 shown.

[0035] When only counting the occurrence probability of biological substrings with a length of 1 in multiple sequence alignments, that is, k=1 in formula (1), the...

Embodiment 2

[0037] On the basis of Example 1, we can combine multiple SDFMs generated based on biological sequence substrings of different lengths to include more biological sequence evolution information. Taking protein SDFM as an example, we can combine SDFM with k=1, 2, 3 into a matrix. The schematic diagram of the combination of the sequence-dependent frequency matrix of a specific protein sequence is as follows figure 2 shown. Firstly, SDFMs of biological substrings of different lengths were generated, and then aligned and spliced ​​according to the corresponding amino acid positions to form a matrix with a larger dimension.

[0038]The technical scheme of the present invention takes into account the interdependence between biological sequence sites, and increases the information of biological sequence site dependencies on the basis of the original specificity scoring matrix PSSM, and uses the sequence-dependent frequency matrix SDFM to perform biological Sequence evolution informa...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a sequence-order dependent frequency matrix (SDFM)-based biological sequence evolution information extraction method. An SDFM is adopted for performing biological sequence evolution information extraction. The SDFM is obtained by adopting the following steps of for any biological sequence, firstly searching for a corresponding biological sequence database, and generating corresponding multiple sequence alignment (MSA); and secondly performing statistics on an occurrence frequency of each site biological sequence sub-string in the MSA to obtain the SDFM shown in a formula (1). According to the technical scheme, a dependency relationship of adjacent sites in a biological sequence is considered, and more more-accurate biological sequence evolution characteristics of functions, structures and the like can be extracted from multi-sequence alignment, so that probability distribution information subjected to statistics contains sequence site dependency relationship information.

Description

technical field [0001] The invention belongs to the technical field of bioinformatics, and in particular relates to a method for extracting biological sequence evolution information based on a sequence-dependent frequency matrix and its application. Background technique [0002] How to extract more evolutionary information from biological sequences is one of the most important basic tasks in the field of bioinformatics. The position-specific scoring matrix PSSM (Position Specific Scoring Matrix, position-specific scoring matrix) extracted based on multiple sequence alignments representing biological sequences is a commonly used technique for extracting biological sequence evolution information. Multiple sequence alignment aligns two or more biological sequences by inserting gaps. Then PSSM counts the frequencies of nucleotides (DNA, RNA) or amino acids (Protein) that appear at different positions in the multiple sequence alignment. According to the statistical probability ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F19/22
CPCG16B30/00
Inventor 刘滨陈俊杰郭明月
Owner HARBIN INST OF TECH SHENZHEN GRADUATE SCHOOL
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products