Method and system for analyzing biological sequence with known sequence

A biological sequence and sequence technology, applied in the field of bioinformatics, can solve problems such as increased computational complexity, inability to predict target sequences, and low recognition sensitivity

Pending Publication Date: 2020-05-08
SHENZHEN HUADA GENE INST
View PDF4 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0009] (1) The method based on exact string matching, on the one hand, cannot guarantee to find all possible tandem repeat sequences in the sequence, on the other hand, the computational complexity of this method will show an exponential form with the length of the copy of the tandem repeat sequence in the sequence increase
[0010] (2) Based on the method of digital signal processing, on the one hand, binary representation is used for nucleic acid sequences, and four discrete Fourier transforms are required for each sequence to obtain the spectrum of nucleic acid sequences, which requires a large amount of calculation.
In addition, it is necessary to calculate the windowed Fourier transform of each tandem repeat copy frequency to obtain the positions of all tandem repeat sequences in the nucleic acid sequence, and the recognition sensitivity is low
[0011] (3) At present, only nucleic acid sequences are used, but protein sequences cannot be predicted. However, studies have found that repeated functional domains in protein sequences also play an important role
[0012] (4) At present, only the method of ab initio prediction is implemented, but it cannot be predicted for some target sequences

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and system for analyzing biological sequence with known sequence
  • Method and system for analyzing biological sequence with known sequence
  • Method and system for analyzing biological sequence with known sequence

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0107] The inventor selected four plants as test objects, and predicted their periodic repeat sequences respectively. In order to compare the performance of the present invention, the inventor compares it with TRF software, and the results are shown in Table 1.

[0108] Table 1: Comparison of TRF and SearchPRE prediction results for repetitive sequences

[0109]

[0110] Note: For the above two software, only one CPU is used, and the default parameters are used for others.

[0111] Through testing, it can be seen that the present invention predicts more periodic repeat sequences, including number and length, for the four plants, and the results include most of the TRF prediction results, indicating a higher accuracy, while other parts mainly This is because the present invention allows differences in the repeating units and special cases of other sequences between adjacent repeating units. And the present invention is all lower than conventional method aspect running time...

Embodiment 2

[0113] In this example, the X species region Chr1:500000-600000 is taken as the research object to explore whether there is a periodic repeat sequence in this region, and the specific steps are as follows:

[0114] 1) Since the analyzed area is local, it can be realized by setting the parameter -Region: -Region Chr1, 500000, 600000;

[0115] 2) For the nucleic acid sequence corresponding to the region, use the present invention to predict, run commands and parameters, as follows:

[0116] SearchPRE-InFile Ref.fasta-Region Chr1,500000,600000-Window 1000–K10–Hkmer 0.3–HkR 20–HkRGap 200-HkRExt 1000-PRElen 0.8-1.2–PREgap 3-PRscore0.5-PRCratio 0.8-Outdir All

[0117] From the results, if Figure 6 and Table 2, only 1 periodic repeat sequence region was detected in this region, located in the coding region. In order to verify the authenticity of the periodic repeat sequence in the coding region, the protein sequences of all genes in this region were predicted.

[0118] 3) Accordi...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a method for analyzing a biological sequence with a known sequence. The method comprises the following steps: (a) determining all Kmer sequences of a biological sequence on thebasis of the biological sequence, wherein the biological sequence is obtained by windowing a large-fragment amino acid sequence or nucleotide sequence; (b) determining the frequency number of each ofall the Kmer sequences, and determining at least one high-frequency Kmer and at least one low-frequency Kmer based on the frequency number; and (c) determining a preliminary repetition candidate region based on the one high-frequency Kmer, and determining whether to integrate the low-frequency Kmer into the preliminary repetition candidate region based on the distance between the low-frequency Kmer and the adjacent high-frequency Kmer.

Description

technical field [0001] The present invention relates to the field of biological information, in particular, the present invention relates to a method and system for analyzing biological sequences with known sequences. Background technique [0002] Studies have shown that in the genomes of higher organisms, the non-coding regions account for the vast majority of the genome sequence, such as the human genome ~3Gb, but the non-coding regions account for up to ~97%. Most non-coding sequences exist in the form of highly repetitive sequences, such as satellites, minisatellites, microsatellites, long interspersed elements, short interspersed elements, etc. The types of various repetitive sequences are closely related to their distribution on chromosomes. [0003] Previously, repeat sequences were thought to be nothing more than bits of redundant, or "useless" DNA. However, a large number of experiments and studies have shown that repetitive sequences are not garbage, but affect th...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G16B30/10
Inventor 方东明陈晓丽吴晨雨陈利鹏刘欢刘心
Owner SHENZHEN HUADA GENE INST
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products