Method for determining optimal sequence alignment threshold for gene database

A technology for sequence alignment and gene data, which is applied in the biological field to achieve accurate alignment results and superior alignment performance.

Active Publication Date: 2021-02-12
PEKING UNIV
View PDF3 Cites 2 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0006] Aiming at the current problem of selecting the optimal sequence alignment threshold (similarity and E value) for gene databases generally based on empiricism, the present invention proposes for the first time that all protein sequences of protein sequence databases (for example, Swiss-Prot protein sequence database) are used Construct a simulated data set, compare the simulated sequences in the gene database, and evaluate the sequence alignment effect with sensitivity, accuracy and Matthews correlation coefficient (MCC), so as to provide a rapid and scientific basis for the gene database Key Techniques for Determining Optimal Sequence Alignment Thresholds (i.e., Similarity and E Value)

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for determining optimal sequence alignment threshold for gene database
  • Method for determining optimal sequence alignment threshold for gene database
  • Method for determining optimal sequence alignment threshold for gene database

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0052] Taking the quorum sensing gene database as an example, the process of determining the optimal sequence alignment threshold (ie, similarity and E value) is described in detail below.

[0053] Step 1), in the Swiss-Prot protein sequence database of the UniProt protein database (https: / / www.uniprot.org / uniprot / ?query=reviewed:yes), download all protein sequences to the local, a total of 557134 protein sequences .

[0054] Step 2), from the 557,134 protein sequences obtained in step 1), remove the protein sequences (245 in total) that have been included in the quorum sensing gene database, and use the remaining 556,889 protein sequences as a false quorum sensing gene data set . For the protein sequences in the fake quorum sensing gene data set, "F" is marked after the sequence name, taking the sequence F4HRV8 as an example, marked as follows:

[0055]

[0056] Step 3), divide the protein sequences in the quorum sensing gene database into 11 subcategories according to t...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a method for determining an optimal sequence alignment threshold for a gene database. The method comprises the following steps: 1) acquiring a protein sequence; 2) removing sequences included in the gene database from the protein sequences, and creating a pseudogene data set; 3) dividing protein sequences in the gene database into subclasses to serve as a true gene data set; 4) combining the pseudogene data set and the true gene data set, and simulating a DNA sequence with a specific length generated by high-throughput sequencing for any protein sequence to obtain a simulated data set; 5) performing sequence comparison, and comparing a comparison threshold value to obtain a value; 6) judging a sequence comparison result, and calculating the numbers of true positive,mismatch, false positive, false negative and true negative; 7) calculating sensitivity, accuracy and a Mathesian correlation coefficient; 8) drawing a three-dimensional curved surface graph by takingthe similarity as an X axis, the E value as a Y axis and the sensitivity, the accuracy or the Malaysian correlation coefficient as a Z axis; and 9) determining an optimal sequence alignment thresholdof the gene database in the three-dimensional curved surface graph.

Description

technical field [0001] The invention belongs to the field of biotechnology, and relates to a method for determining the optimal sequence comparison threshold for a gene database by combining means such as simulation data set construction and sequence comparison effect evaluation. Background technique [0002] In recent years, high-throughput sequencing technology has developed rapidly. Because of its advantages of high throughput, high accuracy, and rich information, high-throughput sequencing technology is widely used in microbial ecology research to explore the diversity, community structure, and ecological functions of complex microbial communities. In particular, high-throughput sequencing technology solves the problem that most microorganisms cannot be isolated and cultured, and provides a powerful technical means for the research on the metabolic potential and ecological function of these microorganisms. [0003] The use of high-throughput sequencing technology to obt...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G16B30/10G16B40/00G16B50/00
CPCG16B30/10G16B40/00G16B50/00
Inventor 刘思彤潘珏君陈倩
Owner PEKING UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products