Eureka AIR delivers breakthrough ideas for toughest innovation challenges, trusted by R&D personnel around the world.

Methods of predicting pathogenicity of genetic sequence variants

A gene sequence and pathogenicity technology, applied in the field of predicting the pathogenicity of gene sequences, can solve problems such as time-consuming

Inactive Publication Date: 2018-02-16
COUNSYL INC
View PDF2 Cites 14 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Traditionally, gene sequence variant interpretation has been dominated by many manual, time-consuming processes due to disparate forms of relevant information in clinical databases and literature
[0005] However, the high resolution of sequencing data poses challenges for the interpretation of genetic sequence variants

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Methods of predicting pathogenicity of genetic sequence variants
  • Methods of predicting pathogenicity of genetic sequence variants
  • Methods of predicting pathogenicity of genetic sequence variants

Examples

Experimental program
Comparison scheme
Effect test

example 1

[0120] Example 1: Training data, training a machine learning model, and testing the trained machine learning model

[0121] Figure 5A An exemplary embodiment of the present invention is illustrated. At an electronic device having at least one processor and memory, a machine learning model is trained based on the training data. The training data includes a labeled benign gene sequence variant dataset and an unlabeled gene sequence variant dataset. like Figure 5A As shown in , the labeled benign dataset was obtained from the 1000 Genomes Project by filtering the database for gene sequence variants with greater than 95% derived allele frequency (DAF), which were assumed to be benign due to their high frequency . The labeled benign dataset has 881,924 gene sequence variants. The unlabeled gene sequence variation dataset was simulated using CADD's variation simulation software, which mutates loci according to local mutation rates in sliding 1.1 Mb windows. Mutation rates we...

example 2

[0123] Example 2: Semi-supervised clustering of mutant machine learning models compared to previous methods

[0124] The methods described in this paper perform better than previously known methods in predicting the pathogenicity of sequence variants. One embodiment of the method described herein (in Figure 6A , Figure 6B , Figure 7A , Figure 7B , Figure 8 and Figure 10 labelled in and described herein as "SSCM-Pathogenic") manifestations including CADD (described in Kircher et al. Nature Genetics, 46(3):310-5 (2014)) and other known methods A comparison of known methods for generating gene sequence variant pathogenicity scores.

[0125] As proof of conception of one embodiment of the method described herein, a genetic sequence variation test dataset is classified into pathogenic clusters and benign clusters. The genome sequence variation test dataset includes the known pathogenic gene sequence variation test dataset and the known benign gene sequence variation te...

example 3

[0139] Example 3: Comparison of semi-supervised mutation clustering machine learning model with supervised machine learning model

[0140] An exemplary embodiment of the method disclosed herein (eg, SSCM-Pathogenic) is compared to a supervised machine learning model. The supervised machine learning model uses the same features as the exemplary model, but the supervised machine learning model is used (obtained from 1000G (n = 20, 133)) labeled benign gene sequence variation training dataset and (obtained from HGMD (n = 63,363) ) marked pathogenic gene sequence variation training data set for training. In contrast, an exemplary machine learning model (SSCM-Pathogenic) was developed using a labeled training dataset of benign sequence variants and a mixed dataset of unlabeled sequence variants including benign and pathogenic sequence variants trained.

[0141] To test supervised machine learning models and an exemplary model (SSCM-Pathogenic), these models were tested using the ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

Recent developments in cost-effective DNA sequencing allows for individualized genomic screening of a subject for genetic sequence variants. Training a pathogenicity prediction model using semi-supervised training methods produces a better model for predicting the pathogenicity of a test genetic sequence variant. Provided herein are methods for predicting the pathogenicity of a test genetic sequence variant by utilizing a training data set comprising labeled benign genetic sequence variants unlabeled genetic sequence variants, the unlabeled genetic sequence variants comprising a mixture of benign genetic sequence variants and pathogenic genetic sequence variants. The genetic sequences are annotated with one or more features and a machine learning model is trained in a semi-supervised process based on the training data. The test genetic sequence is then annotated using the one or more features and the probability that the test genetic sequence variant is pathogenic is predicted based onthe trained machine learning model.

Description

[0001] CROSS-REFERENCE TO RELATED APPLICATIONS [0002] This application claims U.S. Provisional Application No. 62 / 183,132, filed June 22, 2015, U.S. Provisional Application No. 62 / 221,487, filed September 21, 2015, and U.S. Provisional Application No. 62 / 221,487, filed October 2, 2015. Priority Benefit of Provisional Application No. 62 / 236,797. The entire content of each of these applications is hereby incorporated by reference. technical field [0003] The following disclosure relates generally to predicting the pathogenicity of genetic sequences, and more specifically to predicting the pathogenicity of genetic sequence variations. Background technique [0004] The advent of cost-effective DNA sequencing has provided the clinic with high-resolution information about genetic sequence variations in patients, which has led to a need for efficient interpretation of this genomic data. Such tests provide patients with actionable information that allows them to understand the...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F15/18G16B40/20G06N20/00G16B20/00G16B20/20G16B20/40G16B30/10G16B40/30
CPCG06N20/00G16B20/00G16B40/00G16B30/00G16B40/30G16B30/10G16B40/20G16B20/20G16B20/40G06N7/01
Inventor I·S·哈克E·A·埃文斯S·M·维克兰M·D·拉斯穆森
Owner COUNSYL INC
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Eureka Blog
Learn More
PatSnap group products