A genome-wide association analysis method based on multiple genome comparisons and next-generation sequencing data

A whole-genome, next-generation sequencing technology, applied in the fields of genomics, sequence analysis, proteomics, etc., can solve the problems that the organizational form of the pan-genome cannot be unified, and the pan-genome is not easy to understand, so as to achieve convenient understanding and subsequent application, accurate Effects of Genome Structural Variation Analysis and Identification

Active Publication Date: 2022-03-15
RICE RES INST GUANGDONG ACADEMY OF AGRI SCI
View PDF8 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, there are many deficiencies in the current construction of the pan-genome strategy, and the organizational form of the pan-genome cannot be unified. Due to the variety of genome variations, it indicates that the pan-genome may not be linear. Some researchers have proposed a genome based on graph theory, but the multidimensional graph The formal pan-genome is not easy to understand, and the classic analysis software such as BWA that compares the existing second-generation sequencing data to the reference genome is only suitable for linear reference genomes; 2. In terms of whole-genome sequencing data, compared with the third-generation sequencing technology , Hi-C technology, etc. Next-generation sequencing technology has advantages in price and accuracy, and can provide effective sequencing data for mining population-wide genome structural variation; 3. Currently, there are various methods for detecting structural variation using next-generation sequencing data and linear reference genomes , including relying on sequencing coverage depth methods, paired-end sequencing methods, sequence read length segmentation methods, and assembly methods. Each method has its own characteristics, and its advantages and disadvantages coexist.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A genome-wide association analysis method based on multiple genome comparisons and next-generation sequencing data
  • A genome-wide association analysis method based on multiple genome comparisons and next-generation sequencing data
  • A genome-wide association analysis method based on multiple genome comparisons and next-generation sequencing data

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0052] Example 1: Rice Reference Genome and Gene Annotation Update

[0053] Using the 33 rice genomes that have been assembled and annotated, and rice Nipponbare MSU as the initial reference genome, DHX2, 02428, Kosh, ZH11, KY131, Lemont, NamRoo, LJ, G46, CN1, FS32, DG, D62, II32 were sequentially compared , R527, S548, 9311, Y58S, J4155, G8, Y3551, IR64, R498, TM, Tumba, G630, YX1, WSSM, FH838, N22, Basmati1, CG14.

[0054] Such as figure 2 As shown, in the first round of alignment, the reference genome (MSU) was compared with the first de novo assembled genome (DHX2) using MUMmer software to obtain the collinearity characteristics between the genomes and generate a delta file;

[0055] Use Assemblytics software to extract collinearity features, use python software to organize collinearity feature files to screen insert size (>50bp), and obtain structural variation position information data file (Assemblytics_structural_variants.bed file) based on the reference genome posit...

Embodiment 3

[0074] Example 3: Sequence Structure Variation Mining Based on Next Generation Sequencing Data and Updated Reference Genome

[0075] Such as Figure 4 As shown, use the BWA file to mount the second-generation sequencing data (fq) on the updated reference genome, use the pipeline and SAMtools software to convert the output data into a bam file, use Picad software to sort the bam file, and remove repeated operations to get sorted_add_dedup .bam file, use SAMtools software to remove the sequencing fragments whose comparison quality is less than 20, cannot be mounted on the reference genome, and match to multiple places, and the filtered mapQ20.bam file is obtained.

[0076]Use the self-programmed 2Map_fq_to_Pan.py to mount all the sequencing data in the fqd_dir directory to the reference genome, and generate the bam_dir folder, which contains the mapQ20.bam files of all samples. The sequencing data fq.gz file is placed in the same directory fqd_dir, and the format of the paired-...

Embodiment 4

[0083] Example 4: Genome-wide association analysis

[0084] Genome-wide association analysis was performed using Gapit software.

[0085] Using the initial reference genome and sequencing data, the population SNP genotypes were obtained using BWA and GATK software, and the genome-wide association analysis was performed using Gapit software.

[0086] Such as Figure 6 Comparing the results of genome-wide association analysis between the structural variation genotype and the SNP genotype shown, it can be found that almost all the association sites that can be found using the SNP genotype can be found using the structural variation genotype, and new structural variation genotypes can also be found. associated sites. Since the updated gene annotation also brings in new genes, it makes up for the shortcomings of the limited number of genes annotated in the initial reference genome.

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a genome-wide association analysis method based on multiple genome comparisons and second-generation sequencing data. Step 1: use comparison software to compare a reference genome and a de novo assembled genome file; Insert into the position of the reference genome, and update the annotated gene position or structure in the reference genome. Step 3: If there are multiple de novo assembled genomes, update the reference genome iteratively in turn; Step 4: Use the comparison software to compare the next-generation sequencing data of the sample. For the updated reference genome, step 5, collect the demarcation point position information of all structural variations of all samples into a set, and construct a population genotype. Step 6: perform functions according to the associated sites and the updated reference genome annotation file The present invention realizes the genome-wide association analysis using structural variation genotype data.

Description

technical field [0001] The invention relates to a reference genome based on multiple genome comparisons and its annotation update, mining of sequence structure variation based on next-generation sequencing data, population genotype construction based on multi-sample genome structure variation, and genome-wide association based on structural variation genotype The analysis method, in particular to a genome-wide association analysis method based on multiple genome comparisons and next-generation sequencing data, belongs to the field of bioinformatics technology. Background technique [0002] Genome-wide association analysis is to detect the genetic variation polymorphism of multiple individuals in the whole genome range, obtain the genotype, and then conduct statistical analysis on the genotype and observable phenotype at the population level, and mine the traits related to trait variation. Gene. There are various forms of genome-wide genetic variation, such as base substitut...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G16B30/10G16B20/20G16B50/10
CPCG16B30/10G16B20/20G16B50/10
Inventor 王健赵均良杨武李方平刘斌董景芳
Owner RICE RES INST GUANGDONG ACADEMY OF AGRI SCI
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products