MHC completion database, and establishment method and application thereof

A construction method and database technology, applied in the field of MHC complete database, can solve problems such as false positive bais in comparison results, highly polymorphic linkage disequilibrium, easy to miss pathogenic sites, etc., to reduce CPU and memory usage , Improve the accuracy of data and the effect of accurate haplotype information

Active Publication Date: 2016-04-20
BGI GENOMICS CO LTD +1
View PDF5 Cites 11 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, due to the high polymorphism and strong linkage disequilibrium of the MHC region sequence, many true pathogenic loci have not yet been well identified
[0003] Most of the current disease research is based on Genome-wide association study (GWAS) research based on genotyping chips, without full coverage sequencing of the MHC region, so it is easy to miss some key pathogenic loci, which requires us to analyze Completion of sites in these regions
However, the high repeatability of the MHC region fragments may easily cause false positive bais in the comparison results, affecting the accuracy of the MHC database

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • MHC completion database, and establishment method and application thereof
  • MHC completion database, and establishment method and application thereof
  • MHC completion database, and establishment method and application thereof

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0044] Embodiment 1 Raw data preparation and variation detection software comparison

[0045] In this example, 8,906 Chinese genomic DNA samples were selected from the BGI project, and the MHCcapture chip was used to capture the sequence of the human MHC region, and the sequence of the captured MHC region was sequenced.

[0046] The original data obtained during the sequencing process in this example is stored in the fastq format file, referred to as the fq format, which stores the read length sequence, that is, the reads are also called reads, and the sequencing quality of the reads and other information. After obtaining the original fq format data, perform basic processing such as removing joints and removing low-quality reads. In this example, the basic processing method adopted by the second-generation sequencing data is used. After the basic processing, a clean sequence, that is, cleanreads, is obtained. The cleanreads are the sequencing results. It should be noted that ...

Embodiment 2

[0053] Example 2 Data filtering and genotype data set

[0054] In this example, the genotype data set in the MHC completion database is constructed according to the variation detection results in Example 1. The genotype data set includes: accurate genotype sites of all samples, including single nucleotide polymorphism sites of the compared populations Point SNPs and insertion deletion polymorphic sites INDELs information.

[0055] In Example 1, we have obtained the genotype result of each sample, that is, the result of variation detection. We use the merge program to extract the genotype result of each sample and cut it into a file to obtain the original genotype of all samples data set.

[0056] And filter the original genotype dataset according to the following three conditions:

[0057] a. Sites with sequencing depth ≥ 6 in the population;

[0058] b. Sites where the missing rate of data in the population is <0.05;

[0059] c. A site where the allelic base type occurs m...

Embodiment 3

[0067] Embodiment 3 MHC completion database

[0068] 1. genotype data set

[0069] In Example 2, we obtained the genotype data set, but the storage format is genotype format. We use GTOOLS software to convert the genotype file into ped and map formats that PLINK can recognize. The parameters are as follows: gtool-G--gsample.gen--ssample.sampleinfo--pedgenotype.ped--mapgenotype.map--snp

[0070] 2. The type data set of HLA typing and the amino acid change information data set corresponding to the typing

[0071] Based on the high-depth reads sequence of each sample, we use the SOAPHLA typing software developed by BGI to perform HLA typing on each sample, obtain the type result of each sample, and store it in ped and map formats, namely HLA A typed dataset of types. For the type results, we find the SNP corresponding to each type based on the IMGT database, and compare it with the SNP at the same position in the human gene standard sequence hg18. If the two are different, we ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a MHC completion database, and an establishment method and application thereof. The MHC completion database comprises a genotype dataset, a HLA typing type dataset, an amino acid changing information dataset, and a HLA haplotype dataset combined together. In the establishment process of the database, LD and HWE are used for the first time for filtering a variation result, and data accuracy is improved. A simple method which is easy to operate is used to obtain SNP distinguishing datasets in least number, and MHC haplotype information is obtained through phasing analysis. Compared with using the whole SNP dataset to perform phasing, the establishment method saves more time, and reduces CPU and memory use, and obtained haplotype information is more accurate. The MHC completion database comprises various datasets in a MHC region, and can effectively complete sites, and lays foundation for deep research of the MHC region.

Description

technical field [0001] This application relates to the field of gene databases, in particular to an MHC completion database, a construction method of the database and an application of the constructed database. Background technique [0002] Major histocompatibility complex (Majorhistocompatibility complex, referred to as MHC) is a highly polymorphic gene group in vertebrates. It originated early in explaining the phenomenon of recipient rejection of donor tissue cells in organ transplantation. In the course of evolution, MHC has produced obvious differences both between species and among individuals of populations. The difference between species is mainly the difference in gene structure, and its genetic basis is the point mutation of alleles, that is, the substitution of nucleotides. The cause of MHC polymorphism is mainly pathogenic pressure in the environment. Many studies have confirmed that MHC is closely related to complex diseases, especially autoimmune diseases, a...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F19/28C12Q1/68
Inventor 刘小敏曹红志刘晓张涛
Owner BGI GENOMICS CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products