Unlock instant, AI-driven research and patent intelligence for your innovation.

A method and system for calculating and characterizing non-occurring k-mer subsequences in multiple species

A feature analysis and subsequence technology, applied in sequence analysis, bioinformatics, instruments, etc., can solve problems such as high algorithm complexity and no sequence consideration, and achieve the effect of improving computational efficiency

Active Publication Date: 2019-05-17
SICHUAN UNIV
View PDF8 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] (1) Most of the studies start with the existing permutations, and do not consider the sequences that have never appeared
[0006] (2) There are very few analyzes and studies on the component properties and arrangement properties of those sequences that have never appeared
[0007] (3) When k-mer is used to discover new non-coding RNA or study RNA structure prediction, when the selected RNA length is too long, the algorithm complexity is very high

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A method and system for calculating and characterizing non-occurring k-mer subsequences in multiple species
  • A method and system for calculating and characterizing non-occurring k-mer subsequences in multiple species
  • A method and system for calculating and characterizing non-occurring k-mer subsequences in multiple species

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0059] In a specific embodiment, the specific technical solutions of the present invention can be described in detail through the following examples. Technical scheme of the present invention comprises the following steps:

[0060] 1. Data preprocessing

[0061] First, obtain the original genome-wide data of species. Since the whole-genome data is usually rich in markers and annotations, it is necessary to extract sequence data from it. In a specific embodiment, the whole genome data in the .gbff format of the species is obtained from the NCBI database, and the sequence data in it is obtained by programming using Perl language. The specific acquisition method can be as follows: in the .gbff format, the sequence data Start with "ORIGIN" and end with the " / / " symbol, use the Perl statement to intercept part of it, and also need to remove the value representing the number of sequences. Secondly, since DNA is a double helix structure and the sequence data is unidirectional, the ...

Embodiment 2

[0077] In yet another specific embodiment, the present invention also provides a system for calculation and feature analysis of multi-species non-appearing k-mer subsequences, said system comprising:

[0078] The data preprocessing module is used to obtain the original genome-wide data of species, extract sequence data, and obtain reverse complementary chain data, and the sequence data and reverse complementary chain data constitute preprocessing data;

[0079] The same species calculation module is used to calculate the non-appearing k-mer subsequences of the same species with different k value lengths based on the preprocessing data;

[0080] The multi-species calculation module is used for the calculation of the whole genome data of multiple species, and the k-mer subsequence does not appear in common under the same k value length among multiple species;

[0081] A characteristic analysis module, configured to perform characteristic analysis based on the data results of k-m...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The present invention provides a method and system for calculating and characterizing non-occurring k-mer subsequences of multiple species. The method includes: acquiring original species whole genome data, and performing data preprocessing on the species whole genome data to obtain a whole genome sequence Preprocessed data; based on the preprocessed data of the whole genome sequence of the species, the calculation of the non-appearing k-mer subsequences under different k value lengths is performed; for the whole genome data of different multiple species, the same k-mer sequence between multiple species is performed. Calculation of common non-appearing k-mer subsequences under the value length; based on the data results of no k-mer subsequences, length variable analysis, proportional statistical analysis and difference significance analysis of GC content and AG content, and Motif discovery were performed. The method can effectively process multi-species whole genome data, accurately calculate the absence of k-mer subsequences, and perform effective analysis, and greatly improve the computational efficiency.

Description

technical field [0001] The present invention relates to a method for calculating and feature analyzing non-appearing k-mer subsequences (namely LAUPs) under different k value lengths for multi-species whole genome data. Background technique [0002] The k-mer frequency statistics of biological sequences is a very basic and important issue in biological information processing. The k-mer frequency statistics can be used to reveal the distribution of various subsequences in biological sequences. It is a measure of sequence similarity. important tool. Therefore, it has important applications in many biological problems such as species identification, metagenomic classification, sequence splicing, multiple sequence alignment, RNA secondary structure prediction, and CpG island research. [0003] From domestic and foreign studies, at the level of k-mer frequency calculation, Wang Shulin et al. studied the frequency counting of k-long DNA subsequences in the whole DNA sequence, des...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G16B30/10
CPCG16B30/00
Inventor 章乐肖铭
Owner SICHUAN UNIV