Check patentability & draft patents in minutes with Patsnap Eureka AI!

Method for identifying biological sequence and deducing species genetic relationship through digitals

A biological sequence and digital identification technology, applied in the application field of informatics in the field of biology, can solve the problem of not being able to identify organisms with a small amount of data

Active Publication Date: 2014-02-05
高扬 +1
View PDF4 Cites 8 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Their method is vulnerable to high dimensionality and few samples and is not suitable for small genomes or short sequences, such as parvoviruses [3] , and it is not possible to identify organisms (sequences) with a small amount of data

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for identifying biological sequence and deducing species genetic relationship through digitals
  • Method for identifying biological sequence and deducing species genetic relationship through digitals
  • Method for identifying biological sequence and deducing species genetic relationship through digitals

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0048] For example, the ID number is gi|2745742| to compile the identification code of AIDS-1 virus. The virus genome sequence consists of 9290 bases. Due to the large amount of data, we show a part of its genome (980 bases) for intuitive understanding, such as figure 1 .

[0049] According to the identification method described in the present invention, an identification code is compiled for AIDS type 1 virus, with 20 rows and 17 columns in total, such as figure 2 As shown in , the expression form of the identification code is as shown in formula 4.

[0050] F A ( k 0 ) A / D k 0 ...

Embodiment 2

[0051] Example 2 Reconstruction of the well-known mammalian evolutionary tree

[0052] Use statistical tools to screen species-specific partial information associations, (i) take 36 mammals as sample species, randomly select 100 sequences with a length of 1kb from each sample species genome as sample sequences; (ii) calculate the sample For the information association and partial information association between the sequence and the mitochondrial genome sequence, k ranges from 0 to 248; (iii) set up 50 different starting points k 0 , a vector with the maximum dimension d=8, and perform analysis of variance and multiple comparisons.

[0053] Table 1 shows the results of the above statistics. The table lists the average failure scores of vector X corresponding to d=2, 4, 6 and 8 Where X represents information association or partial information association. is for 50 random k 0 The failure score W of the corresponding d-dimensional vector X (k 0 , d) average. Normalized t...

Embodiment 3

[0059] Example 3 Construction of parvoviruses (Parvoviruses) phylogenetic tree

[0060] Use statistical tools to test the species specificity of partial information association: (i) take all (32) viral genomes as sample genomes, and randomly select 50 sequences with a length of 1kb from each sample genome as sample sequences; (ii) in k ranges from 0 to 198 to calculate the information association and partial information association between the sample sequence and the whole genome; (iii) set up 50 different starting points k 0 , The vector X with the maximum dimension d=10 is used for analysis of variance and multiple comparisons.

[0061] The results of the multiple comparisons are shown in Table 2, and the average failure score table for d=4, 6, 8 and 10, the average failure score of vector X is 1 Where X represents information association or partial information association. is for 50 random k 0 The failure score W of the corresponding d-dimensional identification code ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides a biological-sequence identifying code based on the base linked characteristics of an organism nucleotide sequence and further provides a whole set of implementing method for identifying a biological sequence and carrying out species genetic relationship analysis through identification codes and an effect evaluating system. The species genetic relationship analysis is the strict inspection for the validity of the identification codes. A result which is obtained by deducing a mammal cladogram and parvovirus genetic relationship conforms to the classified knowledge of biologists, which indicates that the method is actually effective and has high identification-code resolution. The biological-sequence identifying code provided by the invention has the outstanding characteristics of strong identifying capability, small data quantity and the like, a huge genome sequence can be identified by fewer digitals, the identification of the biological sequence can be simplified, and the code can be applied to analysis and comparison, so that the biological-sequence identifying code has great practical value.

Description

technical field [0001] The invention uses bioinformatics method to mine and integrate sequence information correlation features, and then uses numbers to identify biological sequences and species and analyze kinship, which belongs to the application of informatics in the field of biology. Background technique [0002] Biological sequences include amino acid sequences and nucleotide sequences, and nucleotide sequences are further divided into deoxyribonucleic acid (DNA) sequences and ribonucleic acid (RNA) sequences. The DNA sequence is composed of four nucleotide monomers, adenylic acid (A), cytidylic acid (C), guanylic acid (G) and thymidylic acid (T), usually using a four-letter symbol sequence express. Similarly, an RNA sequence can be represented by a four-letter symbol sequence consisting of A, C, G, and U, where U instead of T is uridine. Whole-genome sequences for sequenced species range in length from a few thousand to millions, or even billions of letters. [000...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F19/24
Inventor 高扬罗辽复
Owner 高扬
Features
  • R&D
  • Intellectual Property
  • Life Sciences
  • Materials
  • Tech Scout
Why Patsnap Eureka
  • Unparalleled Data Quality
  • Higher Quality Content
  • 60% Fewer Hallucinations
Social media
Patsnap Eureka Blog
Learn More