A method and system for automatic pairing of gene sequencing multi-sample data files

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A data file and gene sequencing technology, applied in the field of information processing, can solve the problems of huge size, difficult management of data file file names, and easy modification of file names by humans, so as to facilitate management, improve use efficiency, and reduce program execution errors. Effect

Active Publication Date: 2022-05-24

WEST CHINA HOSPITAL SICHUAN UNIV +2

View PDF10 Cites 0 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

However, the amount of samples detected by high-throughput sequencing is very large, so the dual-throughput sequencing method will obtain a large number of files recording reads information. A pair of data files (template strand file and complementary strand file) of reads information are matched. Therefore, how to quickly and accurately match the template strand and complementary strand files of each sample in multi-sample gene sequencing data files is a big problem

[0004] Most of the existing pairing methods use the file names of the samples generated by sequencing to match the template strand file and the complementary strand file, and to distinguish the template strand and the complementary strand. However, the data files of a large number of sequencing samples are difficult to manage only through the file name. And the file name is easy to be artificially modified, which eventually leads to the failure of the pairing program to execute normally

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment 1

[0040] Example 1. Paired gene sequencing multi-sample data files by the method of the present invention

[0041] 1) The file search module interprets whether it is a file, then traverses all files, parses the file name, and the suffix characters of the file name include fq and fastq, which are the FASTQ files that need to be processed, and other files will not be processed.

[0042] 2) The file decompression module judges that the files whose file name suffixes are .gz, .zip, and .bz are compressed files, and need to be decompressed and decompressed into text files, and other files are not processed. During the processing, if an exception occurs, it will be handled by the exception handling module.

[0043] 3) The file reading module reads the content of the FASTQ file line by line, and removes special characters such as spaces and carriage returns at the beginning and end of the line. During the processing, if an exception occurs, it will be handled by the exception handling...

experiment example 1

[0050] Experimental example 1, the automatic pairing effect of the method of the present invention

[0051] In the / fastq1 directory, there are 2900 FASTQ data files of 1450 samples. The traditional method is to manually organize the paired data files of each sample by file name, which takes at least 1 hour.

[0052] Using the method system for automatic pairing in Example 1 of the present invention, the file can be quickly matched and output, and the identification of the template chain and the complementary chain can be made. The filename prefixes of the two paired data files are the same, and the suffix contains R1 characters. file, the suffix containing the R2 character is the complementary chain data file. The whole process only takes less than 3 seconds, and at the same time, two errors are effectively found, namely, the data file cannot be matched and the file decompression error.

[0053] It can be seen that the method of the present invention can accurately, effectiv...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The present invention provides a method and system for automatic pairing of multi-sample data files for gene sequencing. The aforementioned method includes reading the files in the FASTQ format, obtaining the information of the ID of the sequencing fragment, and calculating the summary and summary of each temporary file by using an information summary algorithm. Steps to compare files with matching digests. The method of the invention and the system for realizing the method can quickly and accurately match sample data files, and distinguish the same sample template chain file and complementary chain file, reduce the problem of program execution errors caused by human beings, and improve the use efficiency of computer resources.

Description

technical field [0001] The invention belongs to the field of information processing, and in particular relates to a method and system for automatic pairing of gene sequencing multi-sample data files. Background technique [0002] In the process of gene sequencing, single-end sequencing is the simplest sequencing method. A single sequencing primer is used to make PCR proceed along the direction of the sequencing primer. Therefore, all sequencing fragments (reads) can only be read in one direction, but sequencing The quality of the reads will decrease as the sequencing progresses, resulting in less accurate reads the further back in the sequence. In order to overcome this shortcoming, paired-end sequencing technology has become popular. Sequencing is performed in two different directions from the two ends to the middle to obtain reads in two directions. The length of each read must exceed half of the entire sequence to be tested. The overlapping parts of the two matched reads...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & Authority Patents(China)

IPC IPC(8): G06F16/14G16B50/30

Inventor 应志野辜永红陈一龙于浩澎杨绪亮葛平成孝禹盛玖黄蓉

Owner WEST CHINA HOSPITAL SICHUAN UNIV

Features

R&D
Intellectual Property
Life Sciences
Materials
Tech Scout

Why Patsnap Eureka

Unparalleled Data Quality
Higher Quality Content
60% Fewer Hallucinations

Social media

Patsnap Eureka Blog

Learn More

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

A method and system for automatic pairing of gene sequencing multi-sample data files

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment 1

experiment example 1

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology