A method and system for automatic pairing of gene sequencing multi-sample data files

A data file and gene sequencing technology, applied in the field of information processing, can solve the problems of huge size, difficult management of data file file names, and easy modification of file names by humans, so as to facilitate management, improve use efficiency, and reduce program execution errors. Effect

Active Publication Date: 2022-05-24
WEST CHINA HOSPITAL SICHUAN UNIV +2
View PDF10 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, the amount of samples detected by high-throughput sequencing is very large, so the dual-throughput sequencing method will obtain a large number of files recording reads information. A pair of data files (template strand file and complementary strand file) of reads information are matched. Therefore, how to quickly and accurately match the template strand and complementary strand files of each sample in multi-sample gene sequencing data files is a big problem
[0004] Most of the existing pairing methods use the file names of the samples generated by sequencing to match the template strand file and the complementary strand file, and to distinguish the template strand and the complementary strand. However, the data files of a large number of sequencing samples are difficult to manage only through the file name. And the file name is easy to be artificially modified, which eventually leads to the failure of the pairing program to execute normally

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A method and system for automatic pairing of gene sequencing multi-sample data files

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0040] Example 1. Paired gene sequencing multi-sample data files by the method of the present invention

[0041] 1) The file search module interprets whether it is a file, then traverses all files, parses the file name, and the suffix characters of the file name include fq and fastq, which are the FASTQ files that need to be processed, and other files will not be processed.

[0042] 2) The file decompression module judges that the files whose file name suffixes are .gz, .zip, and .bz are compressed files, and need to be decompressed and decompressed into text files, and other files are not processed. During the processing, if an exception occurs, it will be handled by the exception handling module.

[0043] 3) The file reading module reads the content of the FASTQ file line by line, and removes special characters such as spaces and carriage returns at the beginning and end of the line. During the processing, if an exception occurs, it will be handled by the exception handling...

experiment example 1

[0050] Experimental example 1, the automatic pairing effect of the method of the present invention

[0051] In the / fastq1 directory, there are 2900 FASTQ data files of 1450 samples. The traditional method is to manually organize the paired data files of each sample by file name, which takes at least 1 hour.

[0052] Using the method system for automatic pairing in Example 1 of the present invention, the file can be quickly matched and output, and the identification of the template chain and the complementary chain can be made. The filename prefixes of the two paired data files are the same, and the suffix contains R1 characters. file, the suffix containing the R2 character is the complementary chain data file. The whole process only takes less than 3 seconds, and at the same time, two errors are effectively found, namely, the data file cannot be matched and the file decompression error.

[0053] It can be seen that the method of the present invention can accurately, effectiv...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The present invention provides a method and system for automatic pairing of multi-sample data files for gene sequencing. The aforementioned method includes reading the files in the FASTQ format, obtaining the information of the ID of the sequencing fragment, and calculating the summary and summary of each temporary file by using an information summary algorithm. Steps to compare files with matching digests. The method of the invention and the system for realizing the method can quickly and accurately match sample data files, and distinguish the same sample template chain file and complementary chain file, reduce the problem of program execution errors caused by human beings, and improve the use efficiency of computer resources.

Description

technical field [0001] The invention belongs to the field of information processing, and in particular relates to a method and system for automatic pairing of gene sequencing multi-sample data files. Background technique [0002] In the process of gene sequencing, single-end sequencing is the simplest sequencing method. A single sequencing primer is used to make PCR proceed along the direction of the sequencing primer. Therefore, all sequencing fragments (reads) can only be read in one direction, but sequencing The quality of the reads will decrease as the sequencing progresses, resulting in less accurate reads the further back in the sequence. In order to overcome this shortcoming, paired-end sequencing technology has become popular. Sequencing is performed in two different directions from the two ends to the middle to obtain reads in two directions. The length of each read must exceed half of the entire sequence to be tested. The overlapping parts of the two matched reads...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/14G16B50/30
Inventor 应志野辜永红陈一龙于浩澎杨绪亮葛平成孝禹盛玖黄蓉
Owner WEST CHINA HOSPITAL SICHUAN UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products