System for biological species homology analysis based on protein / gene sequence data

By using an analysis system based on protein/gene sequence data, and combining amino acid frequency and physicochemical property information with dynamic programming to perform homology analysis, the high cost and low efficiency problems of existing technologies have been solved, and rapid and accurate homology analysis of biological species has been achieved.

CN116153397BActive Publication Date: 2026-06-2336TH RES INST OF CETC +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
36TH RES INST OF CETC
Filing Date
2021-11-19
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing methods for analyzing the homology of biological species are costly and inefficient, making it difficult to quickly and effectively identify species and perform homology analysis during the pandemic.

Method used

A protein/gene sequence data analysis system, including data preprocessing, protein sequence analysis, gene sequence analysis, and species homology determination modules, is used to perform homology analysis by combining amino acid frequency, physicochemical properties, and position probability information with dynamic programming.

Benefits of technology

It enables rapid and accurate analysis of biological species homology, reducing time, human and financial costs, and supporting medical professionals to quickly adopt targeted solutions.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116153397B_ABST
    Figure CN116153397B_ABST
Patent Text Reader

Abstract

The present application relates to a kind of biological species homology analysis systems based on protein / gene sequence data, belong to biological species homology analysis technical field, solve the problem of high cost and low efficiency of detection method in prior art.The system of the present application includes: data preprocessing module, for the obtained protein / gene original data is carried out data cleaning, obtain protein / gene sequence data;Data analysis module includes: protein sequence analysis unit, for according to the frequency of each amino acid in protein sequence, position and its physical and chemical properties, generate protein sequence numerical representation vector;Gene sequence analysis unit, for obtaining the optimal matching sequence in two groups of gene sequences;Species homology judging module, for according to protein sequence numerical representation vector / optimal matching sequence, the distance of each two groups of protein / gene sequence data is calculated, and according to the size of distance, the analysis of protein / gene sequence homology nearness.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of biological species homology analysis technology, and in particular to a biological species homology analysis system based on protein / gene sequence data. Background Technology

[0002] In recent years, various viral diseases have gradually increased, significantly impacting the economy and society. In particular, the global pandemic caused by COVID-19 has led to a worldwide economic recession and numerous casualties. In this context, rapid and effective homology analysis and species identification of biological species causing emergencies are crucial. Timely analysis of homology and species classification among biological species is of great significance for medical professionals in developing corresponding drugs and antibodies, and for relevant government departments in responding to epidemic prevention and control.

[0003] Currently, the identification and homology analysis of biological species often involve conducting biomedical experiments and extracting relevant comparative features, which requires significant time, manpower, material resources, and financial costs. During the pandemic, prolonged duration often means a wider spread of the disease, greater difficulty in prevention and control, and more severe economic and social losses. Existing testing methods are not only costly but also inefficient.

[0004] Therefore, there is a lack of existing technologies for analyzing the homology of biological species based on protein / gene sequence data. Summary of the Invention

[0005] In view of the above analysis, the present invention aims to provide a biological species homology analysis system based on protein / gene sequence data to solve the problems of high cost and low efficiency of existing detection methods.

[0006] On one hand, embodiments of the present invention provide a biological species homology analysis system based on protein / gene sequence data, comprising:

[0007] The data preprocessing module is used to clean the acquired raw protein / gene data to obtain protein / gene sequence data;

[0008] The data analysis module includes a protein sequence analysis unit and a gene sequence analysis unit. The protein sequence analysis unit is used to generate a numerical representation vector of the protein sequence based on the frequency, position, and physicochemical properties of each amino acid in the protein sequence. The gene sequence analysis unit is used to obtain the optimal matching sequence between two sets of gene sequences.

[0009] The species homology determination module is used to calculate the distance between every two sets of protein / gene sequence data based on the numerical representation vector / optimal matching sequence of the protein sequence, and to analyze the degree of homology of the protein / gene sequences based on the magnitude of the distance.

[0010] Furthermore, the gene sequence analysis unit is used to calculate all possible similarity comparison scores of two sets of gene sequences using an iterative method, and to find the optimal matching sequence by backtracking using dynamic programming based on the highest score among all possible similarity comparison scores, and to insert a gap symbol in the gap region of the optimal matching sequence; and to calculate the number of nucleotides with the same corresponding position in the two sets of optimal matching sequences, and the number of gap regions at the corresponding positions.

[0011] Further, the protein sequence analysis unit includes:

[0012] The amino acid frequency information calculation subunit is used to generate an amino acid frequency information vector of the protein sequence of group X based on the frequency of occurrence of various amino acids in the protein sequence of group X biological species.

[0013] The subunit for calculating the average value of amino acid physicochemical properties is used to generate the average value vector of amino acid physicochemical properties of protein sequences in group X based on the physicochemical properties of each amino acid in the protein sequences of group X biological species.

[0014] The amino acid position probability calculation subunit is used to generate an amino acid position probability vector of group X based on the position information of amino acids 1 to K in the protein sequence of group X biological species; when K≥2, the amino acid position probability vectors of each group are reduced in dimensionality to obtain the reduced amino acid position probability vector; the k-character amino acid is k specified consecutive amino acids, where 1≤k≤K.

[0015] Furthermore, the amino acid position probability calculation subunit is also used to generate X sets of amino acid position probability vectors based on the position information of amino acids 1 to K in the protein sequence, including:

[0016] For each group of protein sequences, the following operation is performed to obtain the probability vector of amino acid positions for group X:

[0017] The protein sequence is sorted from 1, and the sorting number corresponding to the first amino acid in the k-word amino acid sequence is used as the position information value of the k-word amino acid.

[0018] Calculate the various k-word amino acids in sequence The sum of positional information values ​​in the protein sequence Where i is the i-th type of k-character amino acid, 1≤i≤20 K ;

[0019] Through various k-amino acids Sum of location information values The ratio of the ... Obtain the probability vector D of the position of the k-word amino acid k Where 1≤k≤K;

[0020] The position probability vector D of amino acids from 1 to K 1 ~D K The amino acid position probability vector V' is formed by splicing these together. d The amino acid position probability vector V' d Let M1 be an M1-dimensional vector, where M1 = 20 + 20 2 +…20 k +…20 K .

[0021] Furthermore, the amino acid position probability calculation subunit is also used to reduce the dimensionality of each group of amino acid position probability vectors to obtain a dimensionality-reduced amino acid position probability vector when K≥2, including:

[0022] The amino acid position probability vector is zero-mean normalized to obtain the measurement matrix X′;

[0023] The covariance matrix S of the measurement matrix X′ is decomposed into M1 eigenvalues, which are then arranged in descending order. The eigenvectors corresponding to the first M eigenvalues ​​are used to form an eigenvector matrix. Obtaining the eigenvector matrix The corresponding amino acid position probability vector V d ;

[0024] V d Let be the M-dimensional amino acid position probability vector obtained after dimensionality reduction.

[0025] Furthermore, the amino acid frequency information calculation subunit is also used to: count the number of occurrences of each type of amino acid in the protein sequence, and obtain the amino acid frequency information vector by the ratio of the number of occurrences of each amino acid to the total number of amino acids in the protein sequence; the amino acid frequency information vector V f , expressed as:

[0026] V f = (f1, f2, ..., f i …, f 20 )

[0027]

[0028]

[0029] Among them, f i amino acids Frequency information, n i amino acids The number of times it appears, where N is the total number of amino acids in the protein sequence. It is the i-th amino acid in a 1-word amino acid series.

[0030] Furthermore, the amino acid physicochemical property average value calculation subunit is used to select J kinds of physicochemical property parameter values ​​for various 1-word amino acids, and to standardize the physicochemical property parameter values ​​of each amino acid according to the maximum and minimum values ​​of the various 1-word amino acid physicochemical property parameter values ​​to obtain the standardized physicochemical property parameters of each amino acid.

[0031] Based on the standardized physicochemical properties of various amino acids and their frequency of occurrence, the average values ​​of each physicochemical property are calculated to obtain the average value vector of amino acid physicochemical properties; the average value vector V of amino acid physicochemical properties is... p , expressed as:

[0032]

[0033]

[0034]

[0035] in, To standardize physical property data, P ji For the i-th type of amino acid The value of the j-th physicochemical property parameter, P ab For the bth type of amino acid The value of the a-th physicochemical property parameter, f represents the average value of each physicochemical property in the protein sequence. i For the i-th type of amino acid Frequency information, 1≤j≤J.

[0036] Furthermore, the species homology determination module includes: a protein sequence species homology determination unit and a gene sequence species homology determination unit;

[0037] The protein sequence species homology determination unit is used to analyze the degree of homology between two sets of protein sequences based on the distance between them.

[0038] When the distance between a protein sequence of an unknown biological species and a protein sequence of a known biological species is less than a protein distance threshold, the unknown biological species and the known biological species are homologous.

[0039] When the distance between the protein sequence of a certain group of unknown biological species and the protein sequences of all known biological species is greater than the protein threshold, the biological species with the closest homology to the unknown biological species is determined based on the shortest distance between the protein sequence of the unknown biological species and the protein sequences of all known biological species.

[0040] The gene sequence species homology determination unit is used to analyze the similarity of protein sequence homology based on the distance between the two sets of gene sequences.

[0041] When the distance between the gene sequences of a group of unknown biological species and the gene sequences of a group of known biological species is less than a gene distance threshold, the unknown biological species and the known biological species are homologous.

[0042] When the distance between the gene sequence of a certain group of unknown biological species and the gene sequences of all known biological species is greater than a gene threshold, the biological species with the closest homology to the unknown biological species is determined based on the shortest distance between the gene sequence of the unknown biological species and the gene sequences of all known biological species.

[0043] Furthermore, the protein sequence species homology determination unit is used to construct numerical representation vectors for different protein sequences based on the amino acid frequency information vector, the average value vector of amino acid physicochemical properties, and the amino acid position information vector. Based on these numerical representation vectors, it calculates the distance d(S,T) between every two groups of protein sequences S and T. The distance d(S,T) between the two groups of protein sequences is expressed as:

[0044]

[0045] V = (V f V d V p )

[0046] Among them, V S [q] and V T [q] represents the q-th element in the numerical representation vectors of protein sequence S and protein sequence T, respectively, 1≤q≤Q, Q=20+M+8, and M is the amino acid position probability vector V. d The dimension of V; V is the numerical representation vector, V f V is the amino acid frequency information vector. d V is the amino acid position information vector. p This is the vector of average values ​​of the physicochemical properties of amino acids.

[0047] Furthermore, the gene sequence species homology determination unit is used to calculate the distance between each pair of gene sequence data based on the optimal matching sequence of the two sets of gene sequence data. The distance between the two sets of gene sequence data, d(S,T), is expressed as:

[0048]

[0049] math represents the number of nucleotides that are identical at corresponding positions in the two sets of optimal matching sequences, and gap represents the number of missing regions at corresponding positions in the two sets of optimal matching sequences.

[0050] Compared with the prior art, the present invention can achieve at least one of the following beneficial effects:

[0051] 1. This invention combines the frequency information of amino acid occurrence, the average value information of amino acid physicochemical properties, and the probability information of K-type amino acid positions to comprehensively and accurately analyze protein sequences. By comparing the distance between two protein sequences, protein homology analysis can be performed more accurately.

[0052] 2. This invention can accurately and quickly analyze gene sequence homology by measuring the distance between the optimal matching sequences in the gene sequence;

[0053] 3. This invention uses a species homology comparison analysis method based on protein / gene sequence data to quickly classify the genetic information of species, which is beneficial for relevant medical workers to take targeted measures.

[0054] 4. The method and system for analyzing the homology of biological species and identifying species based on biological protein / gene sequence data greatly reduces the time required for experiments compared with traditional methods, saving human, material and financial costs.

[0055] 5. In practice, different organizations can choose to conduct homology analysis of biological species based on the convenience of work and cost control, and choose gene or protein sequences that are economical, effective and familiar to them.

[0056] In this invention, the above-described technical solutions can be combined with each other to achieve more preferred combinations. Other features and advantages of this invention will be set forth in the following description, and some advantages may become apparent from the description or be learned by practicing the invention. The objects and other advantages of this invention can be realized and obtained from what is particularly pointed out in the description and drawings. Attached Figure Description

[0057] The accompanying drawings are for illustrative purposes only and are not intended to limit the invention. Throughout the drawings, the same reference numerals denote the same parts.

[0058] Figure 1 This is a schematic diagram of a biological species homology analysis system based on protein / gene sequence data, as shown in one embodiment of this application. Detailed Implementation

[0059] Preferred embodiments of the present invention will now be described in detail with reference to the accompanying drawings, which form part of this application and are used together with the embodiments of the present invention to illustrate the principles of the present invention, but are not intended to limit the scope of the present invention.

[0060] like Figure 1 As shown, a specific embodiment of the present invention discloses a biological species homology analysis system based on protein / gene sequence data, comprising:

[0061] Data preprocessing module 10 is used to clean the acquired raw protein / gene data to obtain protein / gene sequence data;

[0062] Specifically, with the emergence and rapid development of new technologies such as big data, artificial intelligence, and transfer learning, bioinformatics has also entered a period of rapid development. Analyzing the homology between species and identifying species categories through the combination of bioinformatics and existing new technologies is characterized by "data-driven, rapid, and accurate" approaches. Whether it's a virus or other organism, the main functions are accomplished by protein and gene sequences. Protein synthesis is controlled by genes, meaning that proteins are the dominant expression of genetic information. Therefore, homology analysis and species identification based on biological sequence data are very helpful in studying the evolutionary relationships of genetic information in different species and classifying species, and are also key to quickly identifying the source of an epidemic.

[0063] Specifically, a protein sequence is a sequence composed of different amino acids arranged in various ways. There are 20 types of amino acids, and the types, numbers, and sequences of amino acids vary in protein sequences with different functions. Ultimately, to achieve activity and function, it may require other processes such as rotation and folding, but all are achieved based on the production of protein sequences. When conducting homology analysis of biological species based on protein sequence data, the raw protein sequence data of the biological species is first extracted using various protein sequence extraction methods; the raw protein sequence data is then preprocessed to remove outliers. Optionally, the protein sequence extraction includes, but is not limited to, biological gene transcription, chemical methods, or electromagnetic methods; the preprocessing of the raw data includes, but is not limited to, specified data extraction, data cleaning, and data feature transformation. More specifically, the content and types of various amino acids differ in different types of protein sequences; there are a total of 20 types of amino acids. Each represents one of 20 amino acids, where 1 ≤ i ≤ 20.

[0064] Specifically, a gene sequence is the primary structure of a real or hypothetical DNA molecule carrying genetic information, represented by a string of letters. The only possible letters are A, C, G, and T, and each letter represents a nucleotide.

[0065] Many raw data sources suffer from various problems such as garbled characters, disordered formatting, and missing data. Therefore, when performing homology analysis of gene sequences in biological species, gene sequence data of the biological species are first extracted using various biological gene sequence extraction methods. The raw data is then preprocessed to remove outliers. Optionally, biological gene sequence extraction methods include, but are not limited to, methods using biological gene transcription, chemical methods, and electromagnetic methods; preprocessing of the raw data includes, but is not limited to, specified data extraction, data cleaning, and data feature transformation.

[0066] The data analysis module 20 includes a protein sequence analysis unit and a gene sequence analysis unit; the protein sequence analysis unit is used to generate a numerical representation vector of the protein sequence based on the frequency, position and physicochemical properties of each amino acid in the protein sequence; the gene sequence analysis unit is used to obtain the optimal matching sequence between two sets of gene sequences.

[0067] The gene sequence analysis unit is used to calculate all possible similarity comparison scores of two sets of gene sequences using an iterative method, and to find the optimal matching sequence by backtracking using dynamic programming based on the highest score among all possible similarity comparison scores. It also inserts a gap symbol in the gap region of the optimal matching sequence and calculates the number of nucleotides with the same corresponding position and the number of gap regions at the corresponding position in the two sets of optimal matching sequences.

[0068] More specifically, in this embodiment, the gene data is analyzed directly using the Smith-Waterman algorithm:

[0069] First, obtain two gene sequences to be compared, A = a1a2…a i …a n B = b1b2…b j …b m Where n and m are the lengths of gene sequence A and gene sequence B, respectively;

[0070] Define scoring criteria:

[0071]

[0072] W k =2k is the empty space penalty of length k; where 1≤i≤n, 1≤j≤m;

[0073] That is, a nucleotide match between two sites earns 3 points, and a mismatch earns -3 points. If a gap occurs, the penalty score is increased linearly (i.e., a penalty score of 2 for one gap and 4 for two gaps).

[0074] Secondly, construct the score matrix H, and the gene sequence A = a1a2…a i …a n It is a horizontal gene sequence, with gene sequence B = b1b2…b j …b m For the vertical gene sequence, the scoring matrix H has a size of n+1 rows and m+1 columns;

[0075] Initialize the first row and first column of the score matrix H, the first row H 0j and the first column H i0 The initial scores are 0, 1≤i≤n, 1≤j≤m;

[0076] Fill the remaining part of the scoring matrix H in a left-to-right, top-to-bottom order, considering matching or non-matching scores (i.e., diagonal scores), introducing gaps (horizontal or vertical scores), and comparing the score of each position with the highest scores above, to the left, and to the upper left of that position. If the score is less than 0, the score for that position is 0. Specifically, in the scoring matrix H, H... ij , expressed as:

[0077]

[0078] Where 1≤i≤n, 1≤j≤m;

[0079] Finally, using dynamic programming, we backtrack from the element with the highest score in the score matrix to the element with a score of 0. Specifically, we establish a vertical gene sequence A′ and a horizontal gene sequence B′. Starting from the element with the highest score in the score matrix, we compare the upper left, upper, and left directions of each current score matrix position. If the maximum value appears at the upper position, we introduce a "-" in the horizontal gene sequence B′ and take the nucleotide from that position in the vertical gene sequence A′. If the maximum value appears at the left position, we introduce a "-" in the vertical gene sequence A′ and take the nucleotide from that position in the horizontal gene sequence B′. If the maximum value appears at the upper left position, we do not introduce a "-", and both the vertical gene sequence A′ and the horizontal gene sequence B′ take the nucleotide from that position. It is worth noting that the backtracking process stops when the backtracked value is 0. Finally, we flip the entire sequences of the vertical gene sequence A′ and the horizontal gene sequence B′ to obtain the final optimal matching sequence of gene sequence A and gene sequence B.

[0080] More specifically, after obtaining the optimal matching sequences of the two sequences, the number of nucleotides with the same corresponding position in the two sets of optimal matching sequences is calculated based on the optimal matching sequences of the two sequences, where gap is the number of missing regions at the corresponding position in the two sets of optimal matching sequences. More specifically, by setting the variable math = 0, the nucleotides at the corresponding positions in the optimal matching sequences obtained above are compared pairwise. If they are the same, then math is increased by 1.

[0081] Let the variable gap = 0. If the nucleotide at the corresponding position of either of the two optimal matching sequences is "-", then find the next position of "-" in the two optimal matching sequences. If at least one sequence at the next position of the two optimal matching sequences has a nucleotide "-", then gap remains unchanged, and continue to find the nucleotide at the next position until the nucleotide at the next position being compared is not a gap "-", then gap increases by 1. That is, one or more consecutive gaps constitute a gap region, and gap is the number of gap regions at the corresponding positions in the two optimal matching sequences.

[0082] Alternatively, dynamic programming or the Needleman-Wunsch algorithm can be used to find the longest common subsequence between gene sequence data of unknown species or between gene sequence data of unknown and known species, thereby determining which two species are more homologous.

[0083] Specifically, the protein sequence analysis unit includes:

[0084] The amino acid frequency information calculation subunit is used to generate an amino acid frequency information vector of the protein sequence of group X based on the frequency of occurrence of various amino acids in the protein sequence of group X biological species.

[0085] The amino acid frequency information calculation subunit is also used to: count the number of occurrences of each type of amino acid in the protein sequence, and obtain the amino acid frequency information vector by the ratio of the number of occurrences of each amino acid to the total number of amino acids in the protein sequence; the amino acid frequency information vector V f , expressed as:

[0086] V f = (f1, f2, ..., f i …, f 20 )

[0087]

[0088]

[0089] Among them, f i amino acids Frequency information, n i amino acids The number of times it appears, where N is the total number of amino acids in the protein sequence. It is the i-th amino acid in a 1-word amino acid series.

[0090] Specifically, the amino acid physicochemical property average value calculation subunit is used to generate an average value vector of amino acid physicochemical properties of the protein sequences of group X based on the physicochemical properties of each amino acid in the protein sequences of group X biological species. Different amino acids have a variety of different physicochemical properties. The average value vector is generated by selecting the common physicochemical properties. Physicochemical properties refer to physical and chemical properties. The physicochemical properties of different amino acids are certain and are known information. There are many physicochemical properties. Optionally, this embodiment uses 8 physicochemical properties of amino acids, including hydrophobicity, molecular weight, solubility, specific rotation ([a]D(H2O)), specific optical rotation ([a]D(HCl)), isoelectric point, ionization state of the carboxyl group of the amino acid in aqueous solution (pk1(-COOH)) and ionization state of the amino group of the amino acid in aqueous solution (pk2(-NH3)).

[0091] More specifically, the amino acid physicochemical property average calculation subunit performs the following operations on each group of protein sequences to obtain the amino acid physicochemical property average vector of X groups of protein sequences: Selecting J physicochemical property parameter values ​​for each 1-word amino acid; standardizing the physicochemical property parameter values ​​of each amino acid based on their maximum and minimum values ​​to obtain standardized physicochemical property parameters for each amino acid; calculating the average value of each physicochemical property based on the standardized physicochemical property parameters and the frequency information of each amino acid to obtain the amino acid physicochemical property average vector; the amino acid physicochemical property average vector V... p , expressed as:

[0092]

[0093]

[0094]

[0095] in, To standardize physical property data, P ji For the i-th type of amino acid The value of the j-th physicochemical property parameter, P ab For the bth type of amino acid The value of the a-th physicochemical property parameter, f represents the average value of each physicochemical property in the protein sequence. i For the i-th type of amino acid Frequency information, 1≤j≤J.

[0096] Specifically, the amino acid position probability calculation subunit is used to generate an amino acid position probability vector for group X based on the position information of amino acids 1 to K in the protein sequences of group X biological species. When K ≥ 2, the dimensionality of the amino acid position probability vectors in each group is reduced to obtain a dimensionality-reduced amino acid position probability vector. The k-character amino acid is a specified set of k consecutive amino acids, where 1 ≤ k ≤ K. The value of K can be freely chosen based on the protein sequence length and computing power. When K = 1, it indicates that only the case of 20 amino acids appearing alone is analyzed; when K = 2, it indicates that the case of two amino acid combinations appearing simultaneously is analyzed, for example, 400 amino acid combinations such as II, IV, VI, and IL; and so on, depending on the value of K, different amino acid combinations of 20 amino acids appearing simultaneously can be analyzed. K We will analyze this situation. The larger the K value, the greater the computing power required. Therefore, the K value can be selected based on the actual application platform.

[0097] Specifically, the amino acid position probability calculation subunit is further configured to generate X sets of amino acid position probability vectors based on the position information of amino acids 1 to K in the protein sequence, including:

[0098] For each group of protein sequences, the following operation is performed to obtain the probability vector of amino acid positions for group X:

[0099] The protein sequence is sorted from 1, and the sorting number corresponding to the first amino acid in the k-word amino acid sequence is used as the position information value of the k-word amino acid.

[0100] Calculate the various k-word amino acids in sequence The sum of positional information values ​​in the protein sequence Where i is the i-th type of k-character amino acid, 1≤i≤20 K ;

[0101] Through various k-amino acids Sum of location information values The ratio of the ... Obtain the probability vector D of the position of the k-word amino acid k Where 1≤k≤K;

[0102] The position probability vector D of amino acids from 1 to K 1 ~D K The amino acid position probability vector V' is formed by splicing these together. d The amino acid position probability vector V' d , expressed as:

[0103] V'd =(D 1 D 2 , ...D k …, D K )

[0104]

[0105]

[0106]

[0107]

[0108] Where k is the number of consecutive amino acids in the k-word amino acid sequence, 1≤k≤K; D k Let k be the probability vector of the amino acid positions. The positional information of the amino acid in the i-th k-th character represents the proportion of its content. For the i-th k-th amino acid The sum of positional information values ​​appearing in the protein sequence, where N is the total number of amino acids in the protein sequence.

[0109] More specifically, the amino acid position probability vector V' d Let M1 be an M1-dimensional vector, where M1 = 20 + 20 2 +…20 k +…20 K The selection of different K values ​​is to find patterns in protein sequence similarity analysis after collecting a large amount of data on the arrangement and combination of amino acids in the protein sequence. However, a large amount of data will increase the workload of data analysis to a certain extent. More importantly, there may be correlations between many data points, which increases the complexity of problem analysis. Therefore, in the analysis process, high-dimensional data can be preprocessed by dimensionality reduction to retain the most important features and remove noise and unimportant features, thereby improving the purpose of data processing. This can save a lot of time and cost in our engineering practice within a certain range of information loss.

[0110] The amino acid position probability calculation subunit is further configured to reduce the dimensionality of each group of amino acid position probability vectors when K≥2, to obtain a dimensionality-reduced amino acid position probability vector, including:

[0111] The amino acid position probability vector is zero-mean normalized to obtain the measurement matrix X′;

[0112] The covariance matrix S of the measurement matrix X′ is decomposed into M1 eigenvalues, which are then arranged in descending order. The eigenvectors corresponding to the first M eigenvalues ​​are used to form an eigenvector matrix. Obtaining the eigenvector matrix The corresponding amino acid position probability vector V d Optionally, M = 72;

[0113] V d This is the M-dimensional amino acid position probability vector obtained after dimensionality reduction, which is the final 72-dimensional amino acid position probability vector obtained after dimensionality reduction.

[0114] Specifically, let's take the example of obtaining a 72-dimensional amino acid position probability vector with K=2 as an example:

[0115] When K=2, then M1=420, which is the amino acid position probability vector V'. d 420-dimensional vector

[0116] The amino acid position probability vectors are used to construct a 1*420 amino acid position probability matrix X, where... 1≤m≤420;

[0117] The amino acid position probability matrix was zero-mean normalized to obtain a 1*420 measurement matrix.

[0118] The covariance matrix S of the 1*420 measurement matrix X′ is decomposed into M1 eigenvalues, which are then arranged in descending order. The eigenvectors corresponding to the first M eigenvalues ​​are used to form an eigenvector matrix. Obtaining the eigenvector matrix The corresponding amino acid position probability vector V d Optionally, M = 72; where the covariance matrix S is expressed as:

[0119] The 420 values ​​on the diagonal of the covariance matrix S are eigenvalues ​​of the covariance matrix S, i.e., z m,m (1≤m≤420) represents the eigenvalues ​​of the covariance matrix S. The 420 eigenvalues ​​are arranged in descending order, and the proportions of the 72 amino acid position information corresponding to the first 72 eigenvalues ​​are taken to form the M-dimensional amino acid position probability vector V obtained after dimensionality reduction. d This refers to the 72-dimensional amino acid position probability vector obtained after dimensionality reduction.

[0120] The species homology determination module 30 is used to calculate the distance between every two sets of protein / gene sequence data based on the numerical representation vector / optimal matching sequence of the protein sequence, and to analyze the degree of homology of the protein / gene sequences based on the magnitude of the distance.

[0121] More specifically, the species homology determination module 30 includes: a protein sequence species homology determination unit and a gene sequence species homology determination unit;

[0122] The protein sequence species homology determination unit is used to analyze the similarity of protein sequence homology based on the distance between the two sets of protein sequences. Optionally, the distance between the two sets of protein sequences can be calculated by Euclidean distance, Manhattan distance, Chebyshev distance or genetic distance.

[0123] Specifically, this invention chooses to use Euclidean distance to calculate the distance between two sets of protein sequences.

[0124] More specifically, the protein sequence species homology determination unit is used to construct numerical representation vectors of different protein sequences based on the amino acid frequency information vector, the average value vector of amino acid physicochemical properties, and the amino acid position information vector. Based on these numerical representation vectors, it calculates the distance d(S,T) between every two sets of protein sequences S and T. The distance d(S,T) between the two sets of protein sequences is expressed as:

[0125]

[0126] V = (V f V d V p )

[0127] Among them, V S [q] and V T [q] represents the q-th element in the numerical representation vectors of protein sequence S and protein sequence T, respectively, 1≤q≤Q, Q=20+M+8, and M is the amino acid position probability vector V. d The dimension of V; V is the numerical representation vector, V f V is the amino acid frequency information vector. d V is the amino acid position information vector. p This is the vector of average values ​​of the physicochemical properties of amino acids.

[0128] Specifically, when the distance between a certain group of protein sequences and at least two other groups of protein sequences from other species is not clearly distinguishable, the next step is to process them by dividing the position probability vectors D of amino acids 1 to K. 1 ~D K The amino acid position probability vector V' is formed by splicing these together. d A second dimensionality reduction is performed, and the position probability vector of the first 20 dimensions, the amino acid frequency information vector, and the average value vector of amino acid physicochemical properties are selected to form the second protein sequence digital representation vector. Then, the distance between protein sequences is calculated.

[0129] Specifically, the protein sequences of biological species in group X include protein sequences of unknown biological species in group x1 and protein sequences of known biological species in group x2, where X = x1 + x2; the distance between each protein sequence in group x1 of unknown biological species and all protein sequences in group x2 of known biological species is calculated, and the homology of the protein sequences of unknown biological species is analyzed by the distance between each protein sequence of unknown biological species and all protein sequences of known biological species;

[0130] More specifically, when the distance between the protein sequences of a certain group of unknown biological species and the protein sequences of a certain group of known biological species is less than a protein distance threshold, then the unknown biological species and the known biological species are homologous.

[0131] When the distance between the protein sequence of a certain group of unknown biological species and the protein sequences of all known biological species is greater than a protein threshold, the biological species with the closest homology to the unknown biological species is determined based on the shortest distance between the protein sequence of the unknown biological species and the protein sequences of all known biological species.

[0132] When multiple groups of protein sequences from unknown biological species are homologous to, or have the closest homology to, protein sequences from known biological species in the same group, they can be grouped together for easier subsequent analysis.

[0133] More specifically, the gene sequence species homology determination unit is used to analyze the similarity of protein sequence homology based on the distance between the two sets of gene sequences:

[0134] The gene sequence species homology determination unit is used to calculate the distance between any two sets of gene sequence data based on the optimal matching sequence of the two sets of gene sequence data. The distance between the two sets of gene sequence data, d(S,T), is expressed as:

[0135]

[0136] math represents the number of nucleotides that are identical at corresponding positions in the two sets of optimal matching sequences, where math > 1, and gap represents the number of missing regions at corresponding positions in the two sets of optimal matching sequences.

[0137] When the distance between the gene sequences of a group of unknown biological species and the gene sequences of a group of known biological species is less than a gene distance threshold, the unknown biological species and the known biological species are homologous.

[0138] When the distance between the gene sequence of a certain group of unknown biological species and the gene sequences of all known biological species is greater than a gene threshold, the biological species with the closest homology to the unknown biological species is determined based on the shortest distance between the gene sequence of the unknown biological species and the gene sequences of all known biological species.

[0139] Compared with existing technologies, the biological species homology analysis method based on protein sequence data proposed in this invention firstly, by combining the frequency information of amino acid occurrences, the average value information of amino acid physicochemical properties, and the probability information of K-shaped amino acid positions, it can comprehensively and accurately analyze protein sequences. By comparing the distance between two protein sequences, it can more accurately analyze protein homology. By comparing the distance between the optimal matching sequences in gene sequences, it can accurately and quickly analyze gene sequence homology. Secondly, through the species homology comparison analysis method based on protein / gene sequence data, this invention can quickly classify the genetic information of species, which is beneficial for relevant medical workers to take targeted measures. Finally, the biological species homology analysis and species identification method and system based on biological protein / gene sequence data greatly reduces the time required for experiments compared with traditional methods, saving human, material, and financial costs.

[0140] Those skilled in the art will understand that all or part of the processes of the methods described in the above embodiments can be implemented by a computer program instructing related hardware, and the program can be stored in a computer-readable storage medium. The computer-readable storage medium may be a disk, optical disk, read-only memory, or random access memory, etc.

[0141] The above description is only a preferred embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any changes or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in the present invention should be included within the scope of protection of the present invention.

Claims

1. A biological species homology analysis system based on protein / gene sequence data, characterized in that, include: The data preprocessing module is used to clean the acquired raw protein / gene data to obtain protein / gene sequence data; The data analysis module includes a protein sequence analysis unit and a gene sequence analysis unit. The protein sequence analysis unit is used to generate a numerical representation vector of the protein sequence based on the frequency, position, and physicochemical properties of each amino acid in the protein sequence. The gene sequence analysis unit is used to obtain the optimal matching sequence between two sets of gene sequences. The protein sequence analysis unit includes: an amino acid position probability calculation subunit, used to generate an X-group amino acid position probability vector based on the position information of amino acids 1 to K in the protein sequences of X groups of biological species; when K≥2, the amino acid position probability vectors of each group are reduced in dimensionality to obtain a dimensionality-reduced amino acid position probability vector; the k-character amino acid is k specified consecutive amino acids, where 1≤k≤K; the value of K is selected based on the protein sequence length and computing power; When K≥2, the dimensionality of the amino acid position probability vectors in each group is reduced to obtain the dimensionality-reduced amino acid position probability vectors, including: The amino acid position probability vector is zero-mean normalized to obtain the measurement matrix. ; For the measurement matrix covariance matrix Eigenvalue decomposition is performed to obtain the covariance matrix. of The eigenvalues ​​are denoted as M, arranged in descending order. The eigenvectors corresponding to the first M eigenvalues ​​are used to form the eigenvector matrix. ; Obtain the eigenvector matrix The corresponding amino acid position probability vector ; The result after dimensionality reduction A dimensional probability vector of amino acid positions; The species homology determination module is used to calculate the distance between every two sets of protein / gene sequence data based on the numerical representation vector / optimal matching sequence of the protein sequence, and to analyze the degree of homology of the protein / gene sequences based on the magnitude of the distance. The species homology determination module includes a gene sequence species homology determination unit; The gene sequence species homology determination unit is used to calculate the distance between any two sets of gene sequence data based on the optimal matching sequence of the two sets of gene sequence data. , expressed as: , This represents the number of nucleotides that are identical at corresponding positions in the two sets of optimally matched sequences. This represents the number of missing regions at corresponding positions in the two sets of optimal matching sequences; The amino acid position probability calculation subunit is further configured to generate X sets of amino acid position probability vectors based on the position information of amino acids 1 to K in the protein sequence, including: For each group of protein sequences, the following operation is performed to obtain the probability vector of amino acid positions for group X: The protein sequence is sorted from 1, and the sorting number corresponding to the first amino acid in the k-word amino acid sequence is used as the position information value of the k-word amino acid. Calculate the various k-word amino acids in sequence The sum of positional information values ​​in the protein sequence Where i represents the i-th type of k-word amino acid, ; Through various k-amino acids Sum of location information values The ratio of the ... The probability vector of the position of amino acid k is obtained. Where 1≤k≤K; The position probability vector of amino acids from 1 to K Concatenate to form the amino acid position probability vector of this group of amino acids The amino acid position probability vector for A dimensional vector, where, ; Calculate the distance between any two sets of protein sequences, including: When the distance between a certain group of protein sequences and at least two other groups of protein sequences is not clearly distinguishable, the amino acid position probability vectors corresponding to each group of protein sequences are analyzed separately. A second dimensionality reduction is performed, and the distance between the protein sequence and the other two groups of protein sequences is calculated again.

2. The biological species homology analysis system based on protein / gene sequence data according to claim 1, characterized in that, The gene sequence analysis unit is used to calculate all possible similarity comparison scores of two sets of gene sequences using an iterative method, and to find the optimal matching sequence by backtracking using dynamic programming based on the highest score among all possible similarity comparison scores. It also inserts a gap symbol in the gap region of the optimal matching sequence and calculates the number of nucleotides with the same corresponding position and the number of gap regions at the corresponding position in the two sets of optimal matching sequences.

3. The biological species homology analysis system based on protein / gene sequence data according to claim 1, characterized in that, The protein sequence analysis unit includes: The amino acid frequency information calculation subunit is used to generate an amino acid frequency information vector of the protein sequence of group X based on the frequency of occurrence of various amino acids in the protein sequence of group X biological species. The subunit for calculating the average value of amino acid physicochemical properties is used to generate the average value vector of amino acid physicochemical properties of protein sequences in group X based on the physicochemical properties of each amino acid in the protein sequences of group X biological species.

4. The biological species homology analysis system based on protein / gene sequence data according to claim 3, characterized in that, The amino acid frequency information calculation subunit is also used to: count the number of occurrences of each monounsaturated amino acid in the protein sequence, and obtain the amino acid frequency information vector by the ratio of the number of occurrences of each amino acid to the total number of amino acids in the protein sequence; the amino acid frequency information vector , expressed as: in, amino acids Frequency information amino acids Number of times it appears This refers to the total number of amino acids in a protein sequence. The first amino acid in the 1st word A type of amino acid.

5. The biological species homology analysis system based on protein / gene sequence data according to claim 3, characterized in that, The subunit for calculating the average value of amino acid physicochemical properties is used to select J kinds of physicochemical property parameter values ​​for various 1-word amino acids, and to standardize the physicochemical property parameter values ​​of each amino acid according to the maximum and minimum values ​​of the physicochemical property parameter values ​​of each amino acid to obtain the standardized physicochemical property parameters of each amino acid. Based on the standardized physicochemical properties of various amino acids and their frequency of occurrence, the average values ​​of each physicochemical property are calculated to obtain an average vector of amino acid physicochemical properties; the average vector of amino acid physicochemical properties... , expressed as: in, To standardize physical property data, For the i-th type of amino acid The value of the j-th physicochemical property parameter, For the bth type of amino acid The value of the a-th physicochemical property parameter, This represents the average value of each physicochemical property in the protein sequence. For the i-th type of amino acid Frequency information .

6. The biological species homology analysis system based on protein / gene sequence data according to claim 3, characterized in that, The species homology determination module also includes: a protein sequence species homology determination unit; The protein sequence species homology determination unit is used to analyze the degree of homology between two sets of protein sequences based on the distance between them. When the distance between a protein sequence of an unknown biological species and a protein sequence of a known biological species is less than a protein distance threshold, the unknown biological species and the known biological species are homologous. When the distance between the protein sequence of a certain group of unknown biological species and the protein sequences of all known biological species is greater than the protein threshold, the biological species with the closest homology to the unknown biological species is determined based on the shortest distance between the protein sequence of the unknown biological species and the protein sequences of all known biological species. The gene sequence species homology determination unit is also used to analyze the similarity of protein sequence homology based on the distance between the two sets of gene sequences: When the distance between the gene sequences of a group of unknown biological species and the gene sequences of a group of known biological species is less than a gene distance threshold, the unknown biological species and the known biological species are homologous. When the distance between the gene sequence of a certain group of unknown biological species and the gene sequences of all known biological species is greater than a gene threshold, the biological species with the closest homology to the unknown biological species is determined based on the shortest distance between the gene sequence of the unknown biological species and the gene sequences of all known biological species.

7. The biological species homology analysis system based on protein / gene sequence data according to claim 6, characterized in that, The protein sequence species homology determination unit is used to construct numerical representation vectors for different protein sequences based on the amino acid frequency information vector, the average value vector of amino acid physicochemical properties, and the amino acid position information vector. Based on these numerical representation vectors, it calculates the distance between every two pairs of protein sequences S and T. The distance between the two sets of protein sequences , expressed as: Among them, and The numerical representation vectors of protein sequences S and T are respectively the first... Each corresponding element , , The amino acid position probability vector dimensionality; To represent vectors numerically, This is a vector representing the frequency information of amino acids. This is a vector containing amino acid position information. This is the vector of average values ​​of the physicochemical properties of amino acids.