A kind of mhc completion database, its construction method and application

A construction method and database technology, which is applied in the field of MHC complete database, can solve problems such as false positive bais in comparison results, affect the accuracy of MHC database, and highly polymorphic linkage disequilibrium, so as to improve data accuracy and reduce CPU and memory usage, time-saving effects

Active Publication Date: 2018-05-01
BGI GENOMICS CO LTD +1
View PDF5 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, due to the high polymorphism and strong linkage disequilibrium of the MHC region sequence, many true pathogenic loci have not yet been well identified
[0003] Most of the current disease research is based on Genome-wide association study (GWAS) research of genotyping chips, without full-coverage sequencing of the MHC region, so it is easy to miss some key pathogenic loci, which requires us to Complete the sites in these regions
However, the high repeatability of the MHC region fragments may easily cause false positive bais in the comparison results, affecting the accuracy of the MHC database

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A kind of mhc completion database, its construction method and application
  • A kind of mhc completion database, its construction method and application
  • A kind of mhc completion database, its construction method and application

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0044] Example 1 Raw data preparation and variation detection software comparison

[0045] In this example, 8,906 Chinese genomic DNA samples were selected from the BGI project, and the MHC capture chip was used to capture the sequence of the human MHC region, and the sequence of the captured MHC region was sequenced.

[0046] The original data obtained during the sequencing process in this example is stored in the fastq format file, referred to as the fq format, which stores the read length sequence, that is, the reads are also called reads, and the sequencing quality of the reads and other information. After obtaining the original fq format data, perform basic processing such as removing joints and removing low-quality reads. This example adopts the basic processing method adopted by the second-generation sequencing data. After the basic processing, a clean sequence is obtained, that is, clean reads , the clean reads are the sequencing results. It should be noted that in th...

Embodiment 2

[0053] Example 2 Data filtering and genotype data set

[0054] In this example, the genotype data set in the MHC completion database is constructed according to the variation detection results in Example 1. The genotype data set includes: accurate genotype sites of all samples, including single nucleotide polymorphism sites of the compared populations Point SNPs and insertion deletion polymorphic sites INDELs information.

[0055] In Example 1, we have obtained the genotype result of each sample, that is, the result of variation detection. We use the merge program to extract the genotype result of each sample and cut it into a file to obtain the original genotype of all samples data set.

[0056] And filter the original genotype dataset according to the following three conditions:

[0057] a. Sites with sequencing depth ≥ 6 in the population;

[0058] b. Sites where the missing rate of data in the population is <0.05;

[0059] c. A site where the allelic base type occurs m...

Embodiment 3

[0067] Embodiment 3 MHC completion database

[0068] 1. genotype data set

[0069] In Example 2, we obtained the genotype data set, but the storage format is genotype format. We use GTOOLS software to convert the genotype file into ped and map formats that PLINK can recognize. The parameters are as follows: gtool-G --g sample.gen --s sample.sampleinfo --pedgenotype.ped --mapgenotype.map --snp

[0070] 2. The type data set of HLA typing and the amino acid change information data set corresponding to the typing

[0071] Based on the high-depth reads sequence of each sample, we use the SOAPHLA typing software developed by BGI to perform HLA typing on each sample, obtain the type result of each sample, and store it in ped and map formats, namely HLA A typed dataset of types. For the type results, we find the SNP corresponding to each type based on the IMGT database, and compare it with the SNP at the same position in the human gene standard sequence hg18. If the two are differe...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The application discloses an MHC completion database and its construction method and application. The MHC completion database of this application includes the genotype data set synthesized together, the type data set of HLA typing, the amino acid change information data set and the HLA haplotype data set; The filtering of mutation results improves the accuracy of data; a simple and easy-to-operate method is used to obtain a data set with the least number of SNPs, and then the MHC haplotype information is obtained in phasing analysis. Compared with phasing with the entire SNP data set, this The application's construction method saves time, reduces CPU and memory usage, and obtains more accurate haplotype information. The MHC completion database of this application includes various data sets of the MHC region, which can effectively complete the sites and lay the foundation for in-depth research on the MHC region.

Description

technical field [0001] This application relates to the field of gene databases, in particular to an MHC completion database, a construction method of the database and an application of the constructed database. Background technique [0002] Major histocompatibility complex (MHC for short) is a highly polymorphic gene group in vertebrates. It originated early in explaining the phenomenon of recipient rejection of donor tissue cells in organ transplantation. In the course of evolution, MHC has produced obvious differences both between species and among individuals of populations. The difference between species is mainly the difference in gene structure, and its genetic basis is the point mutation of alleles, that is, the substitution of nucleotides. The cause of MHC polymorphism is mainly pathogenic pressure in the environment. Many studies have confirmed that MHC is closely related to complex diseases, especially autoimmune diseases, and some MHC types, haplotypes or speci...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F19/28C12Q1/68
Inventor 刘小敏曹红志刘晓张涛
Owner BGI GENOMICS CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products