A Method for Analyzing High-Throughput Sequencing Gene Expression Levels Using Text Alignment

A gene expression level, high-throughput technology, applied in the field of bioinformatics, can solve problems such as large differences and differences in results, and achieve the effect of reducing workload, simple method, and simple and fast splicing

Active Publication Date: 2022-01-25
FOSHAN UNIVERSITY
View PDF5 Cites 2 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] At present, the existing methods for determining gene expression levels through high-throughput sequencing include CLC, Trinity, SOAP, Oases, ABySS, NextGENe, TopHAT, RSEM, eXpress, Sailfish, kallisto, NURD, etc. These methods are still being improved. Each method has its own characteristics and different algorithm principles, and the results obtained by different methods are obviously different (the results of the same algorithm with different setting parameters are also very different), therefore, it is necessary to develop a method suitable for analyzing the gene expression level of high-throughput sequencing still necessary

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A Method for Analyzing High-Throughput Sequencing Gene Expression Levels Using Text Alignment
  • A Method for Analyzing High-Throughput Sequencing Gene Expression Levels Using Text Alignment
  • A Method for Analyzing High-Throughput Sequencing Gene Expression Levels Using Text Alignment

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0062] Example 1 Camellia high-throughput sequencing

[0063] The company provides sequencing services for the fully developed leaves and petals of unopened buds of camellia flowering branches during the flowering period, including total RNA extraction and library construction, paired-end sequencing (Paired-End, Illumina HiSeq 4000). The sequence format is fastq, submit 6G high-quality data (clean data) and 7G unprocessed raw sequencing data (raw data), each sequence length is 150mer, and merge the double-ends to obtain about 50 million sequencing sequences for each sample.

Embodiment 2

[0064] Example 2 Sequencing sequences are numbered, broken up, and randomly combined

[0065] Extract the high-throughput sequencing sequences obtained in Implementation 1, and only keep the sequences. Each sequence is numbered. There are 50 million sequences in this sequence (about 25 million sequences are generated by paired-end sequencing respectively), and the sequencing of one end is from the first sequence. The serial numbers from the sequence to the 25 millionth sequence are 00000001-25000000. Then use step-by-step random sorting every 100,000 → 1 million → 50,000, and merge the sequences in a random way, cut the sequence documents, sort them randomly, and merge all the sequences into one document. Among them, each 1 million sequence documents cut according to 1 million pieces are divided into several directories, and then randomly sorted every 10,000 pieces and then randomly merged to obtain 1 million sequences, and the documents obtained from all directories are rando...

Embodiment 3

[0066] Example 3 Among 1 million sequences, 100,000 are selected as query sequences for comparison, and the expression level of each query sequence is obtained

[0067] Randomly select the 1,000,000 fragmented sequences in Example 2 of the above steps, divide the 1,000,000 sequences into every 100,000 sequences, and randomly select one 100,000 sequences as the query sequence.

[0068] In the above 100,000 query sequences, for each sequence, perform the following operations:

[0069] 1. Take 20 consecutive nucleotide sequences (20mer) every 5 nucleotides, and each query sequence can be divided into 27x20mer and short sequences;

[0070] 2. In each query sequence, randomly select 9 short sequences of 20mer;

[0071] 3. At least 9 randomly selected short sequences of 20mer are used to match and compare with 1 million sequences. At the same time, the complementary strands of at least 9 20mer short sequences are also matched and compared with 1 million sequences, and the matching ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention belongs to the field of bioinformatics and provides a method for analyzing the gene expression of high-throughput sequencing sequences. First, the sequencing sequences are coded, broken up, and randomly combined, and 100,000 sequences are selected as query sequences and respectively compared with 1 million sequences. The sequence was compared, and nine groups of 20mers were randomly selected from each query sequence, and the number of transcripts of the sequence was obtained after deduplication of 1 million sequences. The first and last 20mers of the query sequence were used to assemble from the matched aligned contigs. The expression amount of the spliced ​​sequence is obtained by merging the expression amounts of all query sequence groups, which is equivalent to the expression amount of the negative strand obtained by alignment with the complementary strand. This method can be effectively used in the analysis of high-throughput sequencing gene expression and sequence de novo assembly.

Description

technical field [0001] The invention belongs to the field of bioinformatics, and relates to a method of using the command line of an open source operating system for text matching, performing similarity comparison on short nucleotide sequences obtained by high-throughput sequencing, and splicing matched contiguous sequence groups. Analytical methods for analyzing gene expression levels in individual tissues of organisms. Background technique [0002] High-throughput sequencing technology simultaneously sequences millions of DNA molecules, making it possible to conduct detailed and comprehensive analysis of the transcriptome and genome in a species or sample. At present, commonly used high-throughput sequencing technologies mainly include Roche / 454, ABI / SOLID sequencing technology, Illumina / Solexa sequencing technology, single-molecule sequencing technology, and IonTorrrent sequencing technology. RNA-Seq high-throughput sequencing, also known as transcriptome sequencing, is ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G16B30/10
CPCG16B30/00
Inventor 宋东光
Owner FOSHAN UNIVERSITY
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products