Gene sequencing data simulation system and method for simulating crowd background information

A background information and data simulation technology, applied in the field of data science, can solve the problems of large limitations, inability to simulate population polymorphism, and high sample extraction cost, so as to improve operating efficiency, enrich variation simulation functions, and save computing time. Effect

Active Publication Date: 2019-11-22
XI AN JIAOTONG UNIV
View PDF6 Cites 19 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] (1) In the parameter debugging stage, due to the constraints of objective factors such as high sample extraction costs and rare mutations, the developers of mutation detection software lack various types of samples to debug the mutation detection software, especially parameter debugging
[0004] (2) In the software testing stage, since the real situation of the mutations contained in the obtained test samples is unknown, it is impossible to make a comprehensive and accurate measurement of the accuracy of the mutation detection software
However, most of the above software is only for specific scenarios
The characteristics of the same software as the scene of the present invention and its main disadvantages are as follows: the use of bamsurgeon needs to input a comparison file, and the comparison file containing specific variation can be generated by directly modifying the comparison file, but its parameter setting is not flexible enough , more limited
GemSIM only supports the simulation of single nucleotide site variation, with a single function
dwgsim supports the simulation of single nucleotide site variation, small fragment insertion and deletion (English name: insert and deletion, English abbreviation: indel), chromosomal inversion variation and gene fusion variation, but does not support gene copy number variation and tandem repeat variation simulation of
SinC supports the simulation of gene copy number variation, as well as the simulation of single nucleotide site variation and small fragment insertion and deletion based on it, but does not support the simulation of gene fusion variation, chromosome inversion variation and tandem repeat variation
SeqMaker supports the simulation of single nucleotide site variation, small fragment insertion and deletion, gene fusion variation, copy number variation and inversion variation, but does not support the insertion of large fragments and complex structural variation (English name: Complex structural variant, English abbreviation : CSV), simulation of tandem repeat variation
[0006] Based on the results of literature search, currently there is no software that can fully support all known major mutation types, nor can it simulate population polymorphism, and does not support users to train template length distribution, adjacent site depth distribution, overall depth distribution and Several main data features such as quality value distribution; template refers to randomly interrupting the reference genome into base sequence fragments ranging in length from tens to hundreds of base pairs
In addition, in the face of massive data requirements, the existing software does not have the functions of batch generation of samples under a specific target accuracy, and the verification of the specificity and sensitivity indicators of the variation detection software

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Gene sequencing data simulation system and method for simulating crowd background information
  • Gene sequencing data simulation system and method for simulating crowd background information
  • Gene sequencing data simulation system and method for simulating crowd background information

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0076] see figure 1, the present invention is a gene sequencing data simulation method for simulating the background information of a crowd. First, the reference genome file and the target capture area file are loaded; the target capture area file records the start of each target area that the user pays attention to on the reference genome Coordinates and end coordinates; after the file is loaded, the system starts to enter each variation simulation module, according to the sequencing depth set by the user, seven types of variation (single nucleotide site variation, insertion mutation, deletion mutation, copy number variation, Inversion variation, gene fusion variation, tandem duplication variation), variation frequency, and the coordinates of the variation on the reference genome complete the corresponding simulation; at this time, the print station of the program displays a percentage, indicating the running progress of the program; the sequencing file simulation is completed...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a gene sequencing data simulation system and method for simulating crowd background information. A target capture area file, sequencing depth, seven types of mutation, frequencies of mutation occurrence, and coordinates of mutation on a reference genome are input, the number of templates is determined based on the sequencing depth, the probability distribution of the corresponding template length is generated by using an acceptance rejection algorithm, each template is traversed, the number of the templates which have been traversed is determined, when not all the templates are traversed by the algorithm, copy number mutation simulation, single nucleotide site mutation simulation, gene fusion simulation, tandem repeat simulation, inversion mutation simulation, insertion fragment simulation and deletion fragment simulation are performed on each of the extracted length templates, and reads are generated and written into a sequencing file; when all the templates are traversed, the generation of the sequencing file is completed; read comparison is performed, the simulated sequencing file and a comparison file thereof are output, and the simulation is completed.The gene sequencing data simulation system and method for simulating crowd background information can be used for easily and quickly obtaining a sample containing specific mutation.

Description

technical field [0001] The invention belongs to the technical field of data science with the application background of precision medicine, and specifically relates to a gene sequencing data simulation system and method for simulating the background information of a crowd. Background technique [0002] Precision diagnosis and treatment is the mainstream direction of the development of modern medicine. The basis of precise diagnosis and treatment is gene big data analysis. In recent years, with the implementation of genetic big data plans in various countries around the world, big data has accumulated rapidly, and various data analysis software has emerged as the times require. Among them, mutation detection is the basis of big data analysis. There are dozens of mainstream mutation detection software such as Samtools, GATK, Pindel, and Delly. However, in clinical applications, the accuracy of these variant detection software still needs to be improved. In the face of a varie...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G16B5/00G16B30/00G16B20/20
CPCG16B5/00G16B30/00G16B20/20
Inventor 王申杰王嘉寅张选平韩博刘涛管彦芳王妙王旭文
Owner XI AN JIAOTONG UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products