A method and system for calculating and characterizing non-occurring k-mer subsequences in multiple species
A feature analysis and subsequence technology, applied in sequence analysis, bioinformatics, instruments, etc., can solve problems such as high algorithm complexity and no sequence consideration, and achieve the effect of improving computational efficiency
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
Embodiment 1
[0059] In a specific embodiment, the specific technical solutions of the present invention can be described in detail through the following examples. Technical scheme of the present invention comprises the following steps:
[0060] 1. Data preprocessing
[0061] First, obtain the original genome-wide data of species. Since the whole-genome data is usually rich in markers and annotations, it is necessary to extract sequence data from it. In a specific embodiment, the whole genome data in the .gbff format of the species is obtained from the NCBI database, and the sequence data in it is obtained by programming using Perl language. The specific acquisition method can be as follows: in the .gbff format, the sequence data Start with "ORIGIN" and end with the " / / " symbol, use the Perl statement to intercept part of it, and also need to remove the value representing the number of sequences. Secondly, since DNA is a double helix structure and the sequence data is unidirectional, the ...
Embodiment 2
[0077] In yet another specific embodiment, the present invention also provides a system for calculation and feature analysis of multi-species non-appearing k-mer subsequences, said system comprising:
[0078] The data preprocessing module is used to obtain the original genome-wide data of species, extract sequence data, and obtain reverse complementary chain data, and the sequence data and reverse complementary chain data constitute preprocessing data;
[0079] The same species calculation module is used to calculate the non-appearing k-mer subsequences of the same species with different k value lengths based on the preprocessing data;
[0080] The multi-species calculation module is used for the calculation of the whole genome data of multiple species, and the k-mer subsequence does not appear in common under the same k value length among multiple species;
[0081] A characteristic analysis module, configured to perform characteristic analysis based on the data results of k-m...
PUM
Login to View More Abstract
Description
Claims
Application Information
Login to View More 


