Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Whole genome association analysis method based on comparison of multiple genomes and next-generation sequencing data

A genome-wide, association analysis technology, applied in genomics, sequence analysis, proteomics, etc., can solve the problems of difficult understanding of pan-genome, ununified organization of pan-genome, etc., to achieve convenient understanding and subsequent application, accurate genome Effects of Structural Variation Analysis and Identification

Active Publication Date: 2021-11-09
RICE RES INST GUANGDONG ACADEMY OF AGRI SCI
View PDF8 Cites 2 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, there are many deficiencies in the current construction of the pan-genome strategy, and the organizational form of the pan-genome cannot be unified. Due to the variety of genome variations, it indicates that the pan-genome may not be linear. Some researchers have proposed a genome based on graph theory, but the multidimensional graph The formal pan-genome is not easy to understand, and the classic analysis software such as BWA that compares the existing second-generation sequencing data to the reference genome is only suitable for linear reference genomes; 2. In terms of whole-genome sequencing data, compared with the third-generation sequencing technology , Hi-C technology, etc. Next-generation sequencing technology has advantages in price and accuracy, and can provide effective sequencing data for mining population-wide genome structural variation; 3. Currently, there are various methods for detecting structural variation using next-generation sequencing data and linear reference genomes , including relying on sequencing coverage depth methods, paired-end sequencing methods, sequence read length segmentation methods, and assembly methods. Each method has its own characteristics, and its advantages and disadvantages coexist.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Whole genome association analysis method based on comparison of multiple genomes and next-generation sequencing data
  • Whole genome association analysis method based on comparison of multiple genomes and next-generation sequencing data
  • Whole genome association analysis method based on comparison of multiple genomes and next-generation sequencing data

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0052] Example 1: Rice Reference Genome and Gene Annotation Update

[0053] Using the 33 rice genomes that have been assembled and annotated, and rice Nipponbare MSU as the initial reference genome, DHX2, 02428, Kosh, ZH11, KY131, Lemont, NamRoo, LJ, G46, CN1, FS32, DG, D62, II32 were sequentially compared , R527, S548, 9311, Y58S, J4155, G8, Y3551, IR64, R498, TM, Tumba, G630, YX1, WSSM, FH838, N22, Basmati1, CG14.

[0054] Such as figure 2 As shown, in the first round of alignment, the reference genome (MSU) was compared with the first de novo assembled genome (DHX2) using MUMmer software to obtain the collinearity characteristics between the genomes and generate a delta file;

[0055] Use Assemblytics software to extract collinearity features, use python software to organize collinearity feature files to screen insert size (>50bp), and obtain structural variation position information data file (Assemblytics_structural_variants.bed file) based on the reference genome posit...

Embodiment 3

[0074] Example 3: Sequence Structure Variation Mining Based on Next Generation Sequencing Data and Updated Reference Genome

[0075] Such as Figure 4 As shown, use the BWA file to mount the second-generation sequencing data (fq) on the updated reference genome, use the pipeline and SAMtools software to convert the output data into a bam file, use Picad software to sort the bam file, and remove repeated operations to get sorted_add_dedup .bam file, use SAMtools software to remove the sequencing fragments whose comparison quality is less than 20, cannot be mounted on the reference genome, and match to multiple places, and the filtered mapQ20.bam file is obtained.

[0076]Use the self-programmed 2Map_fq_to_Pan.py to mount all the sequencing data in the fqd_dir directory to the reference genome, and generate the bam_dir folder, which contains the mapQ20.bam files of all samples. The sequencing data fq.gz file is placed in the same directory fqd_dir, and the format of the paired-...

Embodiment 4

[0083] Example 4: Genome-wide association analysis

[0084] Genome-wide association analysis was performed using Gapit software.

[0085] Using the initial reference genome and sequencing data, the population SNP genotypes were obtained using BWA and GATK software, and the genome-wide association analysis was performed using Gapit software.

[0086] Such as Figure 6 Comparing the results of genome-wide association analysis between structural variant genotypes and SNP genotypes, it can be found that almost all the association sites that can be found with SNP genotypes can be found with structural variant genotypes, and new structural variant genotypes can also be found. associated sites. Since the updated gene annotation also brings in new genes, it makes up for the shortcomings of the limited number of genes annotated in the initial reference genome.

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a whole genome association analysis method based on comparison of multiple genomes and next-generation sequencing data, which comprises the following steps of: 1, comparing a reference genome with a de novo assembly genome file by using comparison software; and updating the annotation gene positions or structures in the reference genomes; 3, if a plurality of de novo assembly genomes exist, sequentially iteratively updating the reference genomes; 4, comparing the next-generation sequencing data of the samples to the updated reference genome by using comparison software; 5, collecting demarcation point position information of all structural variations of all the samples into a set, constructing a population genotype; and 6, performing functional gene candidate according to the association site and the updated reference genome annotation file. According to the method, the whole genome association analysis is performed by using the structure variation genotype data.

Description

technical field [0001] The invention relates to a reference genome based on multiple genome comparisons and its annotation update, mining of sequence structure variation based on next-generation sequencing data, population genotype construction based on multi-sample genome structure variation, and genome-wide association based on structural variation genotype The analysis method, in particular to a genome-wide association analysis method based on multiple genome comparisons and next-generation sequencing data, belongs to the field of bioinformatics technology. Background technique [0002] Genome-wide association analysis is to detect the genetic variation polymorphism of multiple individuals in the whole genome range, obtain the genotype, and then conduct statistical analysis on the genotype and observable phenotype at the population level, and mine the traits related to trait variation. Gene. There are various forms of genome-wide genetic variation, such as base substitut...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G16B30/10G16B20/20G16B50/10
CPCG16B30/10G16B20/20G16B50/10
Inventor 王健赵均良杨武李方平刘斌董景芳
Owner RICE RES INST GUANGDONG ACADEMY OF AGRI SCI
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products