Methods and systems for identifying gene mutations

JP2025522368A5Pending Publication Date: 2026-06-15ILLUMINA INC

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: JP · JP
Patent Type: Applications
Current Assignee / Owner: ILLUMINA INC
Filing Date: 2023-06-05
Publication Date: 2026-06-15

Application Information

Patent Timeline

05 Jun 2023

Application

15 Jun 2026

Publication

JP2025522368A5

IPC: G16B20/20

CPC: G16B20/20; G16B30/10; G16B20/10

AI Tagging

Application Domain

Proteomics Genomics

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

Smart Images

Figure 00000000_0000_ABST

Patent Text Reader

Abstract

This specification discloses systems, devices, and methods for identifying recombinant variants (such as gene conversion variants) of genes such as the RHD gene and the RHCE gene, the copy number of the recombinant variants, and the status of gene mutations (e.g., heterozygous or homozygous). In some embodiments, the disclosed systems, devices, and methods involve receiving sequence reads that align to the RHD gene or the RHCE gene, estimating the combined copy number of the RHD gene and the RHCE gene, estimating the copy number of RHD-specific bases and RHCE-specific bases at each of a plurality of predetermined differentiation sites of the RHD gene and the RHCE gene, and calculating the probability of CE-D(2)-CE gene conversion in a nucleic acid sample. * including calculating the probability of CE-D(2)-CE gene conversion.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] Citation by reference to any priority application This application claims priority to U.S. Provisional Application No. 63 / 349,993, filed on June 7, 2022, the entire content of which is incorporated herein by reference.

Background Art

[0002] (Field of the Invention) The technology of the present disclosure relates to the field of nucleic acid sequencing. Specifically, the disclosed technology relates to the detection of RHCE * CE-D(2)-CE gene conversion events in nucleic acid samples.

[0003] (Description of Related Art) Rhesus (Rh) antigens play an important role in the phenotype of red blood cell (RBC) antigens. There are more than 330 RBC antigens. Mutations in RBC antigens can be attributed to mutations within the RHD (Rh blood group D antigen) and RHCE (Rh blood group CcEe antigen) genes. RHCE * Many different duplications, deletions, translocations, and gene conversion events within the RHD gene and the RHCE gene, including CE-D(2)-CE gene conversion events, have been documented in populations.

Summary of the Invention

Means for Solving the Problems

[0004] In one aspect, herein, RHCE in a nucleic acid sample *A system and a computer-implemented method for detecting a CE-D(2)-CE gene conversion event are disclosed. In some embodiments, the method includes receiving sequence reads that align to the RHD gene or the RHCE gene, estimating the copy number of the complex of the RHD gene and the RHCE gene in a nucleic acid sample, estimating the copy number of the RHD-specific bases and the RHCE-specific bases at each of a plurality of predetermined differentiation sites of the RHD gene and the RHCE gene, and estimating the probability of RHCE * CE-D(2)-CE gene conversion in the nucleic acid sample based on the estimated copy number of the complex of the RHD gene and the RHCE gene and the estimated copy number of the RHD-specific bases and the RHCE-specific bases at each of the plurality of predetermined differentiation sites.

[0005] In some embodiments, RHCE * CE-D(2)-CE gene conversion results in a first breakpoint. In some embodiments, the plurality of predetermined differentiation sites include at least two predetermined differentiation sites adjacent to the first breakpoint. In some embodiments, the method further includes identifying one or more sequence reads that span the first breakpoint and include RHD-specific bases at a first predetermined differentiation site adjacent to the first breakpoint and RHCE-specific bases at a second predetermined differentiation site adjacent to the first breakpoint.

[0006] In some embodiments, RHCE * CE-D(2)-CE gene conversion results in a second breakpoint. In some embodiments, the plurality of predetermined differentiation sites include at least two predetermined differentiation sites adjacent to the second breakpoint. In some embodiments, the method includes identifying one or more sequence reads that span the second breakpoint and include RHD-specific bases at a first predetermined differentiation site adjacent to the second breakpoint and RHCE-specific bases at a second predetermined differentiation site adjacent to the second breakpoint.

[0007] In some embodiments, estimating the copy number of RHD-specific bases and RHCE-specific bases at each of a plurality of predetermined differentiation sites of the RHD gene and the RHCE gene includes counting sequence reads containing the RHD-specific bases at the predetermined differentiation sites among the plurality of predetermined differentiation sites and counting sequence reads containing the RHCE-specific bases at the predetermined differentiation sites.

[0008] In some embodiments, RHCE * In some embodiments, calculating the probability of CE-D(2)-CE gene conversion includes estimating the gene-specific copy number at each of the plurality of predetermined differentiation sites based on a value obtained by multiplying the ratio of sequence reads containing the RHD-specific bases or RHCE-specific bases at the predetermined differentiation sites by the estimated combined copy number of the RHD gene and the RHCE gene. In some embodiments, RHCE * Calculating the probability of CE-D(2)-CE gene conversion includes detecting a change in the gene-specific copy number at consecutive predetermined differentiation sites.

[0009] In some embodiments, estimating the combined copy number of the RHD gene and the RHCE gene includes counting sequence reads that align to the RHD gene or the RHCE gene. In some embodiments, estimating the combined copy number includes normalizing the count of sequence reads that align to the RHD gene or the RHCE gene and applying a mixture Gaussian model. In some embodiments, this method takes into account that the orientations of the RHD gene and the RHCE gene are opposite.

[0010] In some embodiments, the plurality of predetermined differentiation sites are identified by a method that includes identifying single nucleotide differences between the sequences of the RHD gene and the RHCE gene in a reference sequence and selecting, as the differentiation sites, single nucleotide differences fixed across the population. In some embodiments, selecting, as the differentiation sites, single nucleotide differences fixed across the population includes receiving, for a plurality of nucleic acid samples, a plurality of sequence reads that align to the RHD gene and the RHCE gene, estimating, for each of the plurality of nucleic acid samples, the gene-specific copy number of the RHD gene and the copy number of the RHCE gene, selecting a subset of nucleic acid samples from the plurality of nucleic acid samples, the subset of nucleic acid samples including nucleic acid samples that are estimated to be diploid for the RHD gene and estimated to be diploid for the RHCE gene, and selecting single nucleotide differences having copy numbers that are consistent with diploidy of the RHD gene and the RHCE gene in at least 90% of the nucleic acid samples of the subset of nucleic acid samples.

[0011] In some embodiments, the method further includes constructing one or more candidate haplotypes. In some embodiments, the one or more candidate haplotypes cover the breakpoint region of the CE-D(2)-CE gene conversion. In some embodiments, constructing one or more candidate haplotypes includes phasing the predetermined differentiation sites using sequence reads aligned to the RHD gene or the RHCE gene. In some embodiments, phasing the predetermined differentiation sites includes constructing one or more candidate haplotypes based on all the sequenced bases of the first predetermined differentiation site and extending the one or more candidate haplotypes to the second predetermined differentiation site by aligning sequence reads of the RHD gene or the RHCE gene. In some embodiments, the first predetermined differentiation site and the second predetermined differentiation site are adjacent to the breakpoint of the CE-D(2)-CE gene conversion. * CE-D(2)-CE gene conversion breakpoint region. In some embodiments, constructing one or more candidate haplotypes includes phasing the predetermined differentiation sites using sequence reads aligned to the RHD gene or the RHCE gene. In some embodiments, phasing the predetermined differentiation sites includes constructing one or more candidate haplotypes based on all the sequenced bases of the first predetermined differentiation site and extending the one or more candidate haplotypes to the second predetermined differentiation site by aligning sequence reads of the RHD gene or the RHCE gene. In some embodiments, the first predetermined differentiation site and the second predetermined differentiation site are adjacent to the breakpoint of the CE-D(2)-CE gene conversion. * Adjacent to the breakpoint of the CE-D(2)-CE gene conversion.

[0012] In some embodiments, the methods disclosed herein further comprise performing variant calling at a plurality of predetermined differentiation sites, wherein the variant calling at the predetermined differentiation sites is performed. In some embodiments, the methods disclosed herein comprise RHCE * further comprising performing variant calling of CE-D(2)-CE gene conversion. In some embodiments, the variant calling comprises homozygous or heterozygous variant calling. In some embodiments, the method further comprises creating a file comprising the variant calling.

[0013] In some embodiments, the predetermined differentiation sites comprise sites corresponding to positions selected from chr1:25405587, chr1:25405596, chr1:25409676, or chr1:25409958 of the reference genome hg38.

[0014] The features of the examples of the present disclosure will become apparent by reference to the following detailed description and drawings. In the drawings, like reference numerals correspond to components that may not be identical but are similar. For the sake of brevity, reference numerals or features having the aforementioned functions may or may not be described in connection with other drawings in which they appear.

Brief Description of the Drawings

[0015]

Figure 1A

Figure 1B

Figure 1C

Figure 2

Figure 3A

Figure 3B

Figure 4

[0016] All patents, patent applications, and other publications are hereby expressly incorporated by reference into this specification to the same extent as if each individual publication, patent, or patent application were specifically and individually indicated to be incorporated by reference. All cited documents are incorporated by reference in their entirety into this specification for the purposes indicated by the context of the citation in this specification in the relevant portions. However, the citation of any document should not be construed as an admission that it is prior art to the present disclosure.

[0017] RHCE * CE-D(2)-CE Accurate blood type determination is necessary for safe blood transfusion. Basic blood type determination by serological methods is the current standard treatment (ABO / Rh+, or Rh-), and may be sufficient to avoid complications in most blood transfusions. However, patients who require repeated blood transfusions (such as those suffering from cancer, sickle cell disease, α-thalassemia, etc.) may benefit from a more comprehensive evaluation of blood antigens. Serological methods can be used for such a wide range of blood type determinations, but depend on the availability of antibodies specific to each blood type and can be cumbersome and expensive. Molecular blood type determination based on a patient's DNA can be an alternative means for more completely analyzing blood antigens.

[0018] The Rhesus (Rh) factor is a protein-based blood group system that is widely used after the ABO blood group. The antigens of the Rh blood group are derived from two genes, RHD and RHCE, which are paralogous genes with approximately 97% identity to each other. Most people are either Rh+ (having an active copy of RHD) or Rh- (not having a copy of RHD), but as a gray zone, there are numerous variants of RHD, so-called weak D, partial D, and DEL phenotypes. Apart from changes in the copy number of the complete RHD gene or the RHCE gene, two mechanisms can cause the formation of D variants, namely, small mutants that lead to amino acid changes, and gene conversion in which a part of one gene is replaced by the other gene.

[0019] The detection of variants in RHD / RHCE can be complicated by the high sequence similarity observed between the two genes and the variable total copy number observed in such genes. In some cases, the gene reads of the RHD / RHCE genes may be misaligned to the wrong gene or mapped to both genes with the same confidence, resulting in a decrease in mapping quality. RHCE * The CE-D(2)-CE gene conversion event is a gene conversion of Exon2 of the RHCE gene. As illustrated in Figure 1A, RHCE * In the CE-D(2)-CE gene conversion event, Exon2 of the RHCE gene is replaced by a copy of Exon2 of the RHD gene.

[0020] As illustrated in Figures 1B and 1C, the RHD gene and the RHCE gene are paralogs and are oriented in opposite directions within the patient's genome. Furthermore, RHCE * The CE-D(2)-CE gene conversion event is not the only potential mutation in such genes. Other duplications, deletions, translocations, and gene conversion events in the RHD gene and the RHCE gene have been observed in the population. Therefore, due to the high homology between the RHCE gene and the RHD gene, when sequencing the RHD gene and the RHCE gene in a nucleic acid sample, RHCE *It may be difficult to detect a CE-D(2)-CE gene conversion event. For example, RHCE * CE-D(2)-CE gene conversion may not be detected, resulting in false negatives when calling SNP variants in nuclear samples from patients. As detailed below, embodiments of the present invention overcome such problems.

[0021] Overview This specification describes methods and systems for detecting CE-D(2)-CE gene conversion events in nucleic acid samples collected from patients. RHCE * in the nucleic acid sample. Methods and systems for detecting CE-D(2)-CE gene conversion events in a nucleic acid sample, in which RHCE * In the disclosed systems and methods for detecting CE-D(2)-CE gene conversion events, RHCE * it has been found that the detection of CE-D(2)-CE gene conversion and the specificity and sensitivity of variant calling in the RHD region and / or the RHCE region in a nucleic acid sample are improved.

[0022] In some embodiments, the disclosed systems and methods include receiving sequence reads that align to the RHD gene or the RHCE gene. Upon receiving the sequence reads, the copy numbers of the RHD gene and the RHCE gene in the nucleic acid sample can be estimated. Estimating the copy numbers can include counting the sequence reads that align to either the RHD region or the RHCE region.

[0023] Next, the disclosed systems and methods can estimate the copy numbers of RHD-specific bases and RHCE-specific bases at each of a plurality of predetermined differentiation sites of the RHD gene and the RHCE gene. Such predetermined differentiation sites can include positions that contain at least one base that differs between the RHD gene and the RHCE gene in the nucleic acid sequence of the RHD gene or the RHCE gene, and the difference is predetermined to be fixed in the population. Thus, using such predetermined differentiation sites, RHCE *It is possible to determine whether a specific sequence read is derived from either the RHD gene or the RHCE gene, including the CE-D(2)-CE gene conversion event.

[0024] Figure 1B illustrates an example of such a site that differs between the RHD gene and the RHCE gene. In some embodiments, the differentiation site is "predetermined", i.e., RHCE * In order to detect the CE-D(2)-CE gene conversion event, it has been identified (such as in a population survey) before performing the methods described herein or implementing the system. In some embodiments, RHCE * The process of detecting the CE-D(2)-CE gene conversion event includes counting sequence reads that contain RHD-specific bases at a predetermined differentiation site and counting sequence reads that contain RHCE-specific bases at a predetermined differentiation site. Using the sequence read counts, the RHD-specific copy number and the RHCE-specific copy number can be estimated at each of the predetermined differentiation sites.

[0025] In some embodiments, the disclosed systems and methods are based on the copy number support information of each observed base at a predetermined differentiation site to determine the RHCE in a nucleic acid sample * The process includes calling variants related to CE-D(2)-CE gene conversion. For example, the method is based on the estimated copy number that supports either the RHD-specific base or the RHCE-specific base at each of a plurality of predetermined differentiation sites, and also based on the estimated combined copy number of the RHD gene and the RHCE gene, to determine the RHCE in a nucleic acid sample * The process may include calculating the probability of CE-D(2)-CE gene conversion. For example, the RHCE in a nucleic acid sample * The probability of CE-D(2)-CE gene conversion can be inferred by observing changes in the estimated copy numbers of the RHD-specific bases and the RHCE-specific bases on consecutive predetermined differentiation sites in the sequenced nucleic acid from a patient.

[0026] RHCE *To further detect CE-D(2)-CE gene conversion events, RHCE * One or more candidate haplotypes can be constructed that include candidate haplotypes covering the breakpoint region of CE-D(2)-CE gene conversion. The candidate haplotypes can be constructed, for example, by phasing a given differentiation site using sequence reads aligned to the RHD gene or the RHCE gene.

[0027] RHCE * To further detect CE-D(2)-CE gene conversion events, the methods and systems disclosed herein use RHCE * It may include identifying one or more sequence reads that span the breakpoint of the CE-D(2)-CE gene conversion event and contain an RHD-specific base at a first predetermined differentiation site adjacent to this breakpoint and an RHCE-specific base at a second predetermined differentiation site adjacent to this breakpoint.

[0028] The disclosed systems and methods use RHCE, for example, by reducing false negatives * The recall (also called sensitivity, which is the proportion of true variants correctly detected) of single nucleotide polymorphisms (SNPs) generated by CE-D(2)-CE gene conversion events can be improved by 20%, 50%, 80%, 100%, or more.

[0029] Definition Unless otherwise defined, technical and scientific terms used in this disclosure have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs. For example, see Singleton et al., Dictionary of Microbiology and Molecular Biology 2nd ed., J. Wiley & Sons (New York, NY 1994), Sambrook et al., Molecular Cloning, A Laboratory Manual, Cold Spring Harbor Press (Cold Spring Harbor, NY 1989). For the purposes of this disclosure, the following terms are defined below.

[0030] As used herein, "nucleotide" includes a nitrogen-containing heterocyclic base, a sugar, and one or more phosphate groups. Nucleotides are the monomeric units of nucleic acid sequences. Examples of nucleotides include, for example, ribonucleotides or deoxyribonucleotides. In ribonucleotides (ribonucleotide, RNA), the sugar is ribose, and in deoxyribonucleotides (deoxyribonucleotide, DNA), the sugar is deoxyribose, i.e., a sugar lacking the hydroxyl group present at the 2'-position of ribose. The nitrogen-containing heterocyclic base can be a purine base or a pyrimidine base. Examples of purine bases include adenine (A) and guanine (G), as well as their modified derivatives or analogs. Examples of pyrimidine bases include cytosine (C), thymine (T), and uracil (U), as well as their modified derivatives or analogs. The C-1 atom of deoxyribose is bonded to N-1 of pyrimidine or N-9 of purine. The phosphate group can be in mono-, di-, or triphosphate form. It should be further understood that these nucleotides may be natural nucleotides, but non-natural nucleotides, modified nucleotides, or analogs of the aforementioned nucleotides can also be used.

[0031] As used herein, "base" or "nucleobase" is a heterocyclic base such as adenine, guanine, cytosine, thymine, uracil, inosine, xanthine, hypoxanthine, or a heterocyclic derivative, analog, or tautomer thereof. Nucleobases can be naturally occurring or synthetic. Non-limiting examples of nucleobases are adenine, guanine, thymine, cytosine, uracil, xanthine, hypoxanthine, 8-azapurine, purine substituted at the 8-position with methyl or bromine, 9-oxo-N6-methyladenine, 2-aminoadenine, 7-deazaxanthine, 7-deazaguanine, 7-deaza-adenine, N4-ethanocytosine, 2,6-diaminopurine, N6-ethano-2,6-diaminopurine, 5-methylcytosine, 5-(C3-C6)-alkynylcytosine, 5-fluorouracil, 5-bromouracil, thiouracil, pseudoisocytosine, 2-hydroxy-5-methyl-4-triazolopyridine, isocytosine, isoguanine, inosine, 7,8-dimethylalloxazine, 6-dihydrothymine, 5,6-dihydrouracil, 4-methyl-indole, ethenoadenine, and non-naturally occurring nucleobases described in U.S. Pat. Nos. 5,432,272 and 6,150,510, and International Publications Nos. 92 / 002258, 93 / 10820, 94 / 22892, and 94 / 24144, and Fasman ("Practical Handbook of Biochemistry and Molecular Biology", pp. 385-394, 1989, CRC Press, Boca Raton, FL) (all of which are incorporated herein by reference in their entirety).

[0032] The terms "nucleic acid" or "polynucleotide" refer to deoxyribonucleotide or ribonucleotide polymers in single-stranded or double-stranded form, and include known analogs of natural nucleotides that hybridize to nucleic acids in a manner similar to that of natural nucleotides such as peptide nucleic acid (PNA) and phosphorothioate DNA, unless otherwise specified. Unless otherwise stated, a particular nucleic acid sequence includes its complementary sequence. Nucleotides include, but are not limited to, ATP, dATP, CTP, dCTP, GTP, dGTP, UTP, TTP, dUTP, 5-methyl-CTP, 5-methyl-dCTP, ITP, dITP, 2-amino-adenosine-TP, 2-amino-deoxyadenosine-TP, 2-thiothymidine triphosphate, pyrrolo-pyrimidine triphosphate, and 2-thiocytidine, and alpha-thiotriphosphates for all of the above, and 2'-O-methyl-ribonucleotide triphosphates for all of the above bases. Modified bases include, but are not limited to, 5-Br-UTP, 5-Br-dUTP, 5-F-UTP, 5-F-dUTP, 5-propynyl dCTP, and 5-propynyl-dUTP.

[0033] As used in the present invention, the term "chromosome" means a genetic carrier of a living cell that is effective for the present invention and is derived from a chromatin strand containing DNA and protein components (especially histones). The conventional internationally recognized individual human genome chromosome numbering system is used herein.

[0034] "Genome" means the complete genetic information of an organism or virus, expressed as a nucleic acid sequence.

[0035] When used in the present invention, the term "reference genome" or "reference sequence" refers to any particular known genomic sequence, either partial or complete, of an organism or virus that can be used to reference a specified sequence from a subject. For example, reference genomes used for human subjects, as well as many other organisms, can be found at the National Center for Biotechnology Information (ncbi.nlm.nih.gov). In various embodiments, the reference sequence may be significantly larger than the reads aligned to it. For example, it may be at least about 100 times larger, or at least about 1000 times larger, or at least about 10,000 times larger, or at least about 10 5 times larger, or at least about 10 6 times larger, or at least about 10 7 times larger. In one example, the reference sequence is that of the full-length genome. Such a sequence may also be referred to as a genomic reference sequence. For example, the reference sequence may be a reference human genome sequence such as hg19, or hg38. In another example, the reference sequence is limited to a particular human chromosome such as chromosome 13. In some embodiments, the reference Y chromosome is the Y chromosome sequence from the human genome version hg19. Such a sequence may also be referred to as a chromosomal reference sequence. Other examples of reference sequences include genomes of other species, as well as chromosomes, partial chromosomal regions (such as strands), etc. of any species. In various embodiments, the reference sequence is a common base sequence or other combination derived from multiple individuals. However, for certain applications, the reference sequence may be taken from a particular individual.

[0036] As used herein, the term "nucleic acid sample" refers to a sample typically derived from a biological fluid, cell, tissue, organ, or organism, which contains a nucleic acid or a mixture of nucleic acids that includes at least one nucleic acid sequence to be screened for copy number variations. In certain embodiments, the nucleic acid sample contains at least one nucleic acid sequence suspected of having a copy number variation. Such samples can include, but are not limited to, sputum / oral fluid, amniotic fluid, blood, blood fractions, or fine needle biopsy samples (such as surgical biopsies, fine needle biopsies), urine, ascites, pleural effusion, etc. Samples are often taken from human subjects (such as patients), but can also be taken from any mammal, including, but not limited to, dogs, cats, horses, goats, sheep, cows, pigs, etc. The sample can be used directly upon obtaining it from the biological source or after performing a pretreatment that modifies the characteristics of the sample. For example, such pretreatment may include preparing plasma from blood, diluting viscous fluids, etc. Further, pretreatment methods can include, but are not limited to, filtration, precipitation, dilution, distillation, mixing, centrifugation, freezing, lyophilization, concentration, amplification, nucleic acid fragmentation, inactivation of interfering components, addition of reagents, lysis, etc. When such a pretreatment method is employed on a sample, in such a pretreatment method, typically, the nucleic acid(s) of interest remain in the test sample and, in some cases, their concentration is proportional to the concentration in the untreated test sample (i.e., a sample not treated with such pretreatment method(s)). Such a "treated" or "processed" sample is still considered a biological "test" sample with respect to the methods described herein.

[0037] The term "read", or "array read" (or sequence determination read) refers to a sequence obtained from a portion of a nucleic acid sample. A read may be represented by a string of nucleotides sequenced from any part, or the whole, of a nucleic acid molecule. Typically, but not necessarily, a read represents a short sequence of contiguous base pairs in a sample. A read may be symbolically represented by the base pair sequence (A, T, C, or G) of a sample portion. To determine whether a read aligns with a reference sequence or meets other criteria, it may be stored in a memory device and appropriately processed. A read may be obtained directly from a sequencing device or indirectly from stored sequence information regarding a sample. In some cases, a read is a DNA sequence of sufficient length (such as at least about 25 bp) that can be used to identify a larger sequence or region, for example, it can be aligned and specifically assigned to a chromosome, or a genomic region, or a gene. For example, a sequence read can be a short string of nucleotides at one or both ends of a nucleic acid fragment (such as 20 - 150 bases) sequenced from the nucleic acid fragment, or the sequencing of the entire nucleic acid fragment present in a biological sample. Sequence reads can be obtained by any method known in the art. For example, sequence reads can be obtained in various ways, such as by the use of sequencing techniques, or by the use of probes such as hybridization arrays, or capture probes, or by the use of amplification techniques such as polymerase chain reaction (PCR), or linear amplification using a single primer, or isothermal amplification. Sequence reads can be generated by techniques such as sequencing by synthesis, sequencing by ligation, or sequencing by binding. Sequence reads can be generated using devices such as the MINISEQ, MISEQ, NEXTSEQ, HISEQ, and NOVASEQ sequencing instruments from Illumina, Inc. (San Diego, CA).

[0038] As used herein, the term "sequencing depth" generally refers to the number of times a locus is covered by sequence reads that are aligned to that locus. The locus may be as small as a nucleotide, or as large as a chromosomal arm, or as large as the entire genome. Sequencing depth can be expressed as 50×, 100×, etc., where "×" refers to the number of times the locus is covered by sequence reads. Sequencing depth can also be applied to multiple loci or the entire genome, in which case x can refer to the average number of times each locus, or haploid genome, or entire genome is sequenced. When the average depth is cited, the actual depths of the different loci included in the dataset vary over a range of values. Ultra-deep sequencing can refer to at least 100× in sequencing depth.

[0039] As used herein, the terms "aligned", "alignment", or "aligning" refer to the process of comparing a read, or tag, to a reference sequence, thereby determining the likelihood that the reference sequence contains the read sequence. If the reference sequence contains the read, the read may be positioned relative to the reference sequence, or in certain alternative embodiments, mapped to a specific location within the reference sequence. For example, from the alignment of a read to a reference sequence of human chromosome 13, the likelihood that the read is present in the reference sequence of chromosome 13 can be determined. In some cases, the alignment further indicates where the read or tag maps within the reference sequence. For example, if the reference sequence is the entire human genome sequence, the alignment may indicate that the read is present on chromosome 13, and further may indicate that the read is on a specific strand and / or site of chromosome 13. A "site" can be a polynucleotide sequence, or a unique position (i.e., chromosome ID, chromosomal position, and orientation) on the reference genome. In some embodiments, the site may be the position of a residue, sequence tag, or segment on the sequence.

[0040] Aligned reads or tags are one or more sequences identified as matches with respect to the order of nucleic acid molecules from a reference genome to a known sequence. Alignment can be done manually, but is typically performed by a computer algorithm because it is not possible to align reads in a reasonable time period to implement the methods disclosed herein. The matching of sequence reads during alignment can be 100% sequence identity or less than 100% (imperfect match).

[0041] Alignment can be performed by variants of methods such as Burrows-Wheeler Aligner (BWA), iSAAC, BarraCUDA, BFAST, BLASTN, BLAT, Bowtie, CASHX, Cloudburst, CUDA-EC, CUSHAW, CUSHAW2, CUSHAW2-GPU, drFAST, ELAND, ERNE, GNUMAP, GEM, Gensearch NGS, GMAP and GSNAP, Geneious Assembler, LAST, MAQ, mrFAST and mrsFAST, MOM, MOSAIK, MPscan, Novoaligh & NovoalignCS, NextGENe, Omixon, PALMapper, Partek, PASS, PerM, PRIMEX, QPalma, RazerS, REAL, cREAL, RMAP, rNA, RTInvestigator, Segemehl, SeqMap, Shrec, SHRiMP, SLIDER, SOAP, SOAP2, SOAP3 and SOAP3-dp, SOCS, SSAHA and SSAHA2, Stampy, SToRM, Subread and Subjunc, Taipan, UGENE, VelociMapper, XpressAlign, and ZOOM, and / or combinations thereof.

[0042] As used herein, the term "mapping" refers to the specific assignment of sequence reads to a larger sequence such as a reference genome by alignment.

[0043] "Genetic variation" or "gene mutation" refers to a specific genotype present in a particular individual, and in many cases, genetic variations exist in a statistically significant subpopulation of individuals. The presence or absence of genetic variance can be determined using the methods or apparatuses described herein. In certain embodiments, the presence or absence of one or more genetic variations is determined according to the results provided by the methods and apparatuses described herein. In some embodiments, a genetic variation is a chromosomal abnormality (such as aneuploidy), a partial chromosomal abnormality, or a mosaic, etc., each of which will be further detailed herein. Non-limiting examples of genetic variations include one or more deletions (such as microdeletions), duplications (such as microduplications), insertions, mutations, polymorphisms (such as single nucleotide polymorphisms), fusions, repeats (such as short tandem repeats), different methylation sites, different methylation patterns, etc., and combinations thereof. An insertion, repeat, deletion, duplication, mutation, or polymorphism can be of any length, and in some embodiments, can be from about 1 base or base pair (bp) to about 250 megabase pairs (Mb) in length. In some embodiments, the length of an insertion, repeat, deletion, duplication, mutation, or polymorphism is from about 1 base, or base pair (bp) to about 1000 kilobases (kb) (e.g., about 10 bp, 50 bp, 100 bp, 500 bp, 1 kb, 5 kb, 10 kb, 50 kb, 100 kb, 500 kb, or 1000 kb).

[0044] In some cases, the genetic variation is a deletion. In certain embodiments, a deletion is a mutation (such as a genetic abnormality) in which a part of a chromosome or a DNA sequence is missing. A deletion is often a loss of genetic material. Any number of nucleotides may be deleted. A deletion can include the deletion of one or more entire chromosomes, chromosomal segments, alleles, genes, introns, exons, any non-coding region, any coding region, segments thereof, or combinations thereof. A deletion can include a microdeletion. A deletion can include a single base deletion.

[0045] Genetic mutations are sometimes genetic duplications. In certain embodiments, a duplication is a mutation (such as a genetic abnormality) in which a portion of a chromosome or a DNA sequence is copied and reinserted into the genome. In certain embodiments, a genetic duplication (i.e., a duplication) is a duplication of a region of DNA. In some embodiments, the duplication is a nucleic acid sequence that is repeated, often tandemly, within the genome or chromosome. In some embodiments, the duplication can include copies of one or more entire chromosomes, chromosomal segments, alleles, genes, introns, exons, any non-coding region, any coding region, segments thereof, or combinations thereof. The duplication can include microduplications. The duplication sometimes includes one or more copies of the duplicated nucleic acid. In some cases, the duplication is characterized in that the gene region is repeated one or more times (such as 1, 2, 3, 4, 5, 6, 7, 8, 9, or 10 times). The duplication can, in some cases, range from a small region (thousands of base pairs) to an entire chromosome. Duplications frequently occur as a result of errors in homologous recombination or due to retrotransposon events. Duplications are associated with certain types of proliferative diseases. Duplications can be characterized using genomic microarrays or comparative genetic hybridization (CGH).

[0046] Genetic mutations are sometimes insertions. An insertion is sometimes an addition of a nucleic acid sequence of one or more nucleotide base pairs. An insertion is sometimes a microinsertion. In certain embodiments, an insertion includes the addition of a chromosomal segment to the genome, chromosome, or segments thereof. In certain embodiments, an insertion includes the addition of an allele, gene, intron, exon, any non-coding region, any coding region, segments thereof, or combinations thereof to the genome or segments thereof. In certain embodiments, an insertion includes the addition (i.e., insertion) of nucleic acid of unknown origin to the genome, chromosome, or segments thereof. In certain embodiments, an insertion includes the addition of a single base (i.e., an insertion).

[0047] Genetic variations may include copy number variations, i.e., variations in the copy number of a nucleic acid sequence present in a test sample as compared to the copy number of the nucleic acid sequence present in a reference sample. In certain embodiments, the nucleic acid sequence is 1 kb or more. In some cases, the nucleic acid sequence is an entire chromosome or a significant portion thereof. Copy number polymorphisms may also refer to nucleic acid sequences in which a difference in copy number has been found by comparing a nucleic acid sequence of interest in a test sample with the expected level of the nucleic acid sequence of interest. For example, the level of a target nucleic acid sequence in a test sample is compared to that present in a qualified sample. Copy number polymorphisms / copy number variations may include deletions including microdeletions, insertions including microinsertions, duplications, amplifications, and translocations. CNV encompasses aneuploidy and segmental aneuploidy.

[0048] RHCE * Embodiments of a method and system for detecting a CE-D(2)-CE gene conversion event FIG. 2 shows RHCE in a nucleic acid sample * FIG. 2 is a block diagram schematically illustrating an exemplary method 200 for detecting a CE-D(2)-CE gene conversion event. In some embodiments, method 200 is executed on a computer. Method 200 may be embodied as a set of executable program instructions stored on a computer-readable medium, such as one or more disk drives of a computing system. For example, server device 3102 shown in FIGS. 3A and 3B and described in more detail below can execute a set of executable program instructions for implementing method 200. When method 200 is initiated, the executable program instructions are loaded into a memory such as RAM and can be executed by one or more processors of server device 3102. Method 200 is described with respect to server device 3102 shown in FIG. 3B, but this description is illustrative only and not intended to be limiting. In some embodiments, method 200, or a portion thereof, can be executed continuously or in parallel by multiple computing systems.

[0049] As seen in FIG. 2, RHCE in a nucleic acid sample* A method 200 for detecting a CE-D(2)-CE gene conversion event can start from block 201. In this block, sequence reads that align to the RHD or RHCE gene are received. For example, sequence reads that align to the RHD gene or the RHCE gene can be mapped to a reference sequence to determine the alignment to the RHD gene or the RHCE gene. Next, method 200 can proceed to block 202, where the copy numbers of the RHD gene and the RHCE gene in the nucleic acid sample are estimated. Next, method 200 can proceed to block 203, where the copy numbers of the RHD-specific bases and the RHCE-specific bases at each of a plurality of predetermined differentiation sites of the RHD gene and the RHCE gene are estimated. Next, method 200 can proceed to block 204, where, based on the estimated copy numbers of the RHD gene and the RHCE gene, and the estimated copy numbers of each of the RHD-specific bases and the RHCE-specific bases at each of the plurality of predetermined differentiation sites, the RHCE in the nucleic acid sample * Calculate the probability of CE-D(2)-CE gene conversion.

[0050] Receiving sequence reads that align to the RHD gene or the RHCE gene In some embodiments, the methods and systems disclosed herein include receiving a plurality of sequence reads that align to the RHD gene or the RHCE gene. In some embodiments, the sequence reads are generated from a sample obtained from a subject.

[0051] Array reads can be generated by techniques such as sequencing by synthesis, sequencing by ligation, or sequencing by combination. Array reads can be generated using devices such as the MINISEQ, MISEQ, NEXTSEQ, HISEQ, and NOVASEQ sequencing devices from Illumina, Inc. (San Diego, CA). Array reads can be, for example, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 1250, 1500, 1750, 2000 or more base pairs (bps) in length. For example, array reads are each about 100 base pairs to about 1000 base pairs in length. Array reads can include paired-end array reads. Array reads can include single-end array reads. Array reads can be generated by whole genome sequencing (WGS). WGS can be clinical WGS (cWGS). Samples can include cells, cell-free DNA, cell-free fetal DNA, amniotic fluid, blood samples, biopsy samples, or combinations thereof.

[0052] In some embodiments, array reads are obtained by aligning the reads to the RHD region, or the RHCE region of a reference sequence. In some embodiments, array reads are obtained by aligning a first plurality of array reads generated from a sample to a reference genome sequence to obtain a second plurality of array reads that align to the RHD gene, or the RHCE gene in the reference genome sequence. In some embodiments, a computing system stores a first plurality of array reads in memory. The computing system can load the first plurality of array reads into memory. Array reads can be aligned to the RHD gene, or the RHCE gene of a reference sequence with an alignment quality score of zero or more. Array reads can be aligned to the RHD gene, or the RHCE gene in a reference sequence with an alignment quality score of about zero (e.g., when the sequences are aligned to regions where the gene, and gene paralogs are highly homologous).

[0053] In some embodiments, the sequence reads are obtained from a file containing sequencing information. In some embodiments, the file is on a computer storage medium (such as a computer hard drive, e.g., a rotating magnetic disk drive or a solid state drive). In some embodiments, the file is stored in the form of a BAM, SAM, CRAM, or VCF file. In some embodiments, the sequence reads cover the breakpoint regions of the CE-D(2)-CE gene conversion event of RHCE * Cover the breakpoint regions of the CE-D(2)-CE gene conversion event.

[0054] Estimation of complex copy number In some embodiments, estimating the copy number of the RHD gene and the RHCE gene involves counting the sequence reads that align to the RHD gene or the RHCE gene. In some embodiments, the copy number between the RHD gene and the RHCE gene is estimated by counting the total number of reads that align to either RHD or RHCE in the reference genome sequence. In some embodiments, counting the total number of reads that align to either RHD or RHCE in the reference genome sequence involves counting the sequence reads that can be mapped with equal confidence (mapping quality is zero) to either the RHD gene or the RHCE gene. In some embodiments, the sequence reads align to regions in both the RHD gene and the RHCE gene, and due to the high homology between the regions of the RHD gene and the RHCE gene, the sequences are identical between these two regions, so the mapping quality is zero. In some embodiments, by counting the sequence reads with low mapping quality (including mapping quality zero), the copy number of the RHD gene and the RHCE gene can be estimated despite the high sequence homology.

[0055] In some embodiments, estimating the copy number comprises normalizing the count of sequence reads that align to the RHD gene, or the RHCE gene, and applying a mixture of Gaussian models. In some embodiments, the mixture of Gaussian models comprises a plurality of Gaussians, each representing a different integer copy number when given the normalized number of sequence reads aligned to the RHD gene, or the RHCE gene (e.g., normalized and / or corrected sequence reads). For example, the read count can be normalized against the length of the region, and further against a set of 3000 genomic regions of 2000bp that are predicted to be diploid and consistent across the entire population. In some embodiments, a mixture of Gaussian models is then used to infer the most likely copy number of the RHD+RHCE genes based on the observed normalized depth signal.

[0056] The total copy number can be, for example, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more. The mixture of Gaussian models can include a one-dimensional mixture of Gaussian models. The plurality of Gaussians of the mixture of Gaussian models can represent integer copy numbers, for example, 0-5, 0-6, 0-7, 0-8, 0-9, 0-10, 0-11, 0-12, 0-13, 0-14 or 0-15. For example, the plurality of Gaussians of the mixture of Gaussian models can represent integer copy numbers from 0 to 10. The mean of each of the plurality of Gaussians can be the integer copy number represented by the Gaussian. The mean of each of the plurality of Gaussians can be an integer copy number (such as 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, or more copy numbers). The standard deviation of the Gaussian can be, for example, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1 or more, or can be about 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1 or more. The plurality of Gaussians of the mixture of Gaussian models can include, for example, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, or more Gaussians. For example, the plurality of Gaussians of the mixture of Gaussian models can include 5 Gaussians.

[0057] To estimate the copy number of the RHD gene and the RHCE gene complex, when the computing system is given the normalized number of sequence reads aligned to the RHD gene or the RHCE gene, it can use a mixture Gaussian model and a predetermined posterior probability threshold to determine the total copy number of the RHD gene and the RHCE gene. The posterior probability threshold can be, for example, 0.7, 0.75, 0.8, 0.85, 0.95, or higher.

[0058] Estimation of gene-specific copy number at a predetermined differentiation site In some embodiments, the methods and systems disclosed herein include estimating the copy number of RHD-specific bases and RHCE-specific bases at each of a plurality of predetermined differentiation sites of the RHD gene and the RHCE gene.

[0059] In some embodiments, sequencing information (such as basecalls) is evaluated at one or more predetermined differentiation sites. As used herein, "predetermined differentiation site" refers to a site in a nucleic acid sequence that differs between the sequences of the RHD gene and the RHCE gene. Predetermined differentiation sites can be, for example, fixed in a population and can be observed as base differences between the RHD gene and the RHCE gene in at least 90%, at least 95%, at least 98%, or at least 99% of the population. In some embodiments, RHCE * Due to CE-D(2)-CE gene conversion, a first breakpoint occurs, and the plurality of predetermined differentiation sites include at least two predetermined differentiation sites adjacent to the first breakpoint. In some embodiments, the plurality of predetermined differentiation sites can include sites corresponding to positions selected from chr1:25405587, chr1:25405596, chr1:25409676, or chr1:25409958 of the reference genome hg38 (available, for example, from GenBank assembly accession GCA_000001405.15).

[0060] In some embodiments, the proportion of reads supporting the predicted RHD-specific bases and RHCE-specific bases is evaluated at each of a plurality of predetermined sites of a predetermined differentiation site. For example, at a plurality of predetermined differentiation sites, sequence reads containing RHD-specific bases at the predetermined differentiation site can be counted, and sequence reads containing RHCE-specific bases at the predetermined differentiation site can be counted. This counting can be normalized using the methods described with reference to the estimation of the copy number of the RHD gene and the RHCE gene complex.

[0061] A computing system (such as server device 3102) can determine a normalized number of sequence reads storing RHD-specific bases or RHCE-specific bases at a given predetermined differentiation site. To determine the normalized number of sequence reads storing RHD-specific bases or RHCE-specific bases, the computing system uses (1a) the depth of sequence reads aligned to the predetermined differentiation site and storing RHD-specific bases or RHCE-specific bases, (1b) the length of the predetermined differentiation site, (2a) the depth of sequence reads aligned to a region of the RHD gene or RHCE gene that does not include the predetermined differentiation site, and / or (2b) the length of each region of the RHD gene or RHCE gene that does not include the predetermined differentiation site to determine the normalized number of sequence reads containing RHD-specific bases or RHCE-specific bases at the predetermined differentiation site.

[0062] RHCE * Estimation of the probability of a CE-D(2)-CE gene conversion event In some embodiments, the methods and systems disclosed herein are based on the estimated copy number of the RHD gene and the RHCE gene complex, and the estimated copy number of RHD-specific bases and RHCE-specific bases at each of a plurality of predetermined differentiation sites, for RHCE in a nucleic acid sample * including the step of calculating the probability of a CE-D(2)-CE gene conversion.

[0063] In some embodiments, gene-specific copy numbers (e.g., the copy numbers of the RHD gene and the RHCE gene, respectively) are estimated for each of a plurality of predetermined differentiation sites. The gene-specific copy number can be based on the proportion of sequence reads that include RHD-specific bases or RHCE-specific bases at the predetermined differentiation site. In some embodiments, the method includes estimating the gene-specific copy number at each predetermined differentiation site by multiplying the proportion of sequence reads that support the RHD-specific bases or the RHCE-specific bases at each predetermined differentiation site by an estimated composite copy number. The gene-specific copy number can be, for example, 0, 1, 2, 3, 4, or more. The gene-specific copy number can be an integer.

[0064] In some embodiments, the method includes detecting a change in the gene-specific copy number (e.g., a change in the proportion of reads that support the RHD-specific bases or the RHCE-specific bases) at successive predetermined differentiation sites, and * estimating the probability of a CE-D(2)-CE gene conversion event. For example, if either the RHD gene or the RHCE gene is replaced in the corresponding region of the other gene, this should result in an increase or decrease in the proportion of reads that support the RHD-specific bases and the RHCE-specific bases at the predetermined differentiation site.

[0065] In some embodiments, RHCE *To determine the likelihood of CE-D(2)-CE gene conversion, the computing system, for one or more pairs of consecutive predetermined differentiation sites at a plurality of predetermined differentiation sites, (1) for each of a plurality of sequence reads aligned to the RHD gene or the RHCE gene, each containing two or more RHCE-specific bases at consecutive predetermined differentiation sites, (2) for each of a plurality of sequence reads aligned to the RHD gene or the RHCE gene, each containing an RHCE-specific base and an RHCE-specific base, or an RHCE-specific base and an RHCE-specific base at consecutive predetermined differentiation sites, and / or (3) for each of a plurality of sequence reads aligned to the RHD gene or the RHCE gene, each containing an RHCE base at consecutive predetermined differentiation sites, is able to determine the copy number of RHCE-specific bases at consecutive predetermined differentiation sites.

[0066] Identification of sequence reads spanning breakpoints In some embodiments, the methods and systems disclosed herein include identifying one or more sequence reads that span a first breakpoint and contain an RHD-specific base at a first predetermined differentiation site adjacent to the first breakpoint and an RHCE-specific base at a second predetermined differentiation site adjacent to the first breakpoint. Thus, this method, among a plurality of sequence reads aligned to the RHD gene or the RHCE gene, for RHCE * covering one of the two breakpoints of CE-D(2)-CE gene conversion, and including at least two predetermined differentiation sites, one on each side of the breakpoint, having an RHD-specific base at a first predetermined differentiation site adjacent to the breakpoint and an RHCE-specific base at a second predetermined site adjacent to the breakpoint, may include identifying one or more sequence reads.

[0067] In some embodiments, RHCE *CE-D(2)-CE gene conversion results in a second breakpoint, and the plurality of predetermined differentiation sites include at least two predetermined differentiation sites adjacent to the second breakpoint. In some embodiments, the method further includes identifying one or more sequence reads that span the second breakpoint and contain RHD-specific bases in a first predetermined differentiation site adjacent to the second breakpoint and RHCE-specific bases in a second predetermined differentiation site adjacent to the second breakpoint. Thus, in some embodiments, the method is RHCE * For each of the two breakpoints of CE-D(2)-CE gene conversion, identify one or more sequence reads that span the breakpoint and contain RHD-specific bases in a first predetermined differentiation site adjacent to the breakpoint and RHCE-specific bases in a second predetermined differentiation site adjacent to the breakpoint.

[0068] In some embodiments, the predetermined differentiation site adjacent to the breakpoint is selected from sites corresponding to positions selected from chr1:25405587, chr1:25405596, chr1:25409676, or chr1:25409958 of the reference genome hg38.

[0069] Construction of candidate haplotypes In some embodiments, the methods and systems disclosed herein further include constructing one or more candidate haplotypes. In some embodiments, the one or more candidate haplotypes are RHCE * Cover the breakpoint region of CE-D(2)-CE gene conversion.

[0070] In some embodiments, constructing one or more candidate haplotypes involves phasing a given polymorphic site using sequence reads aligned to the RHD gene or the RHCE gene. In some embodiments, phasing a given polymorphic site involves constructing one or more candidate haplotypes based on all the sequenced bases of a first given polymorphic site and extending the one or more candidate haplotypes to a second given polymorphic site by aligning sequence reads of the RHD gene or the RHCE gene.

[0071] For example, a candidate haplotype can be formed from all the sequenced bases at a first given polymorphic site. For example, two candidate haplotypes can be formed if there can be two bases at the first given polymorphic site based on basecalls from sequence reads covering the first given polymorphic site. In some embodiments, the haplotype is then extended to the next given polymorphic site by considering all the sequence reads that can be uniquely assigned to a single candidate haplotype. In some embodiments, if such a sequence read supports only a single base at the next polymorphic site of a given candidate haplotype, that haplotype is extended with that base. In some embodiments, if a candidate haplotype can be extended by two hypothesized bases at a second given polymorphic site, both extended hypothesized haplotypes are included in the set of candidate haplotypes, and the set increases by one. In some embodiments, the next extension step is performed at a third given polymorphic site, and the steps can be repeated until processing of all sites is complete. In some embodiments, this process results in a set of candidate haplotypes based on the bases observed at multiple given polymorphic sites.

[0072] In some embodiments, the computing system constructs one or more candidate haplotypes derived from the RHD gene or the RHCE gene in a region of the RHCE gene that includes a plurality of predetermined differentiation sites, using sequence reads aligned to the RHD gene or the RHCE gene that include the plurality of predetermined differentiation sites. For example, the sequence reads can be aligned to a reference sequence such that the sequence reads overlap with the predetermined differentiation sites. The sequence reads can be aligned to a region of the RHD gene or the corresponding region of the RHCE gene that includes a plurality of predetermined differentiation sites with an alignment quality score of zero or more.

[0073] In some embodiments, one or more candidate haplotypes include a wild-type RHD haplotype, a wild-type RHCE haplotype, and / or an RHCE * CE-D(2)-CE haplotype. The RHCE * CE-D(2)-CE haplotype can include both RHD bases and RHCE bases. The RHCE * CE-D(2)-CE haplotype can be a recombinant variant. The RHCE * CE-D(2)-CE haplotype can include an RHCE mutant haplotype. Haplotypes can include reciprocal recombinant variants. Haplotypes can include non-reciprocal recombinant variants or gene conversion variants. The reference sequence can include a reference genomic sequence.

[0074] To phase one or more haplotypes derived from the RHD gene or the RHCE gene, the computing system can analyze linkage information between predetermined differentiation sites at the plurality of predetermined differentiation sites, using sequence reads aligned to the RHD region or the RHCE region that include the plurality of predetermined differentiation sites. To phase one or more haplotypes derived from the RHD gene or the RHCE gene, the computing system can phase one or more haplotypes derived from the RHD gene or the RHCE gene, using sequence reads aligned to two or more of the plurality of predetermined differentiation sites.

[0075] In some embodiments, the first predetermined differentiation site and the second predetermined differentiation site are RHCE * may be adjacent to the breakpoint of the CE-D(2)-CE gene conversion. In some embodiments, the predetermined differentiation site adjacent to the breakpoint is selected from sites corresponding to positions selected from chr1:25405587, chr1:25405596, chr1:25409676, or chr1:25409958 of the reference genome hg38.

[0076] For example, RHCE * The boundary of the CE-D(2)-CE gene conversion event can be confirmed by phasing the predetermined differentiation site using sequencing reads mapped to either the RHD gene or the RHCE gene on each breakpoint region. In some embodiments, this method further includes RHCE * identifying a sequencing read or sequencing read pair that spans the CE-D(2)-CE breakpoint and stores RHD-specific bases and RHCE-specific bases in a continuous predetermined differentiation site, thereby RHCE * including confirming the CE-D(2)-CE gene conversion.

[0077] Identification of Predetermined Differentiation Sites Disclosed herein are methods and systems for identifying a plurality of predetermined differentiation sites. In some embodiments, the method includes identifying single nucleotide differences between the sequences of the RHD gene and the RHCE gene of a reference sequence. For example, the reference sequence of the RHD gene and the reference sequence of the RHCE gene can be aligned and compared with each other to identify all sites containing single nucleotide differences between these two gene sequences. Then, the positions of these differentiation sites in both the RHD and RHCE genes can be stored in an electronic storage device. For example, a file containing a list of single nucleotide differences can be created.

[0078] In some embodiments, the method includes selecting a single nucleotide difference that is fixed across the population as the differentiation site. For example, the method may include receiving a plurality of sequence reads that align to the RHD gene and the RHCE gene for a plurality of nucleic acid samples (such as a plurality of nucleic acid samples from a population of individuals). In some embodiments, the plurality of nucleic acid samples are from individuals in a population of more than 100, more than 500, more than 1,000, more than 5,000, or more than 10,000 individuals. In some embodiments, the population is a diverse population, such as a genetically diverse population that includes individuals from multiple ethnic groups, for example, taking into account differences in the type of population and increasing the likelihood that the single nucleotide difference does not include differences due to the type of population. The method may further include estimating the gene-specific copy number of the RHD gene and the copy number of the RHCE gene for each of the plurality of nucleic acid samples. The method may further include selecting a subset of the nucleic acid samples from the plurality of nucleic acid samples, the subset of nucleic acid samples including nucleic acid samples that are estimated to be diploid for the RHD gene and diploid for the RHCE gene (such as only using data from samples that are estimated not to include CE-D(2)-CE gene conversion). The method may further include selecting a single nucleotide difference having a copy number that is consistent with diploid for the RHD gene and the RHCE gene in at least 90%, at least 95%, at least 97%, at least 98%, or at least 99% of the nucleic acid samples in the subset of nucleic acid samples. * CE-D(2)-CE gene conversion not included, and only using data from samples estimated to be such). The method may further include selecting a single nucleotide difference having a copy number that is consistent with diploid for the RHD gene and the RHCE gene in at least 90%, at least 95%, at least 97%, at least 98%, or at least 99% of the nucleic acid samples in the subset of nucleic acid samples.

[0079] This method may further include generating a file that includes a plurality of predetermined differentiation sites by creating a file listing the positions of the selected single nucleotide differences. In some embodiments, the file is on a computer storage medium (such as a computer hard drive, for example, a rotational magnetic disk drive, or a solid state drive). In some embodiments, the file is stored in the form of a BAM, SAM, CRAM, or VCF file. This file may include information regarding the predetermined differentiation sites, such as the chromosome name where the predetermined differentiation site is located, the one-base inclusive start position of RHCE, the predicted base sequence regarding the RHCE reads mapped to the start position of RHCE, the one-base inclusive start position of RHD, the predicted base sequence regarding the RHD reads mapped to the start position of RHD, the RHCE region corresponding to the RHD start position, the unique name of the predetermined differentiation site, and / or the orientation of the predetermined differentiation site given by the gene orientation.

[0080] Variant calling In some embodiments, the methods and systems disclosed herein further include performing variant calling at a plurality of predetermined differentiation sites. In some embodiments, variant calling is performed at each predetermined differentiation site in the gene that has undergone gene conversion (i.e., the RHCE gene), and the alternative allele is the base observed at the source of the gene conversion event (i.e., the RHD gene). In some embodiments, heterozygous or homozygous variant calling is performed based on the gene-specific copy number observed at each predetermined differentiation site within the gene conversion event region.

[0081] In some embodiments, variant calling is performed for * CE-D(2)-CE gene conversion. In some embodiments, variant calling includes homozygous or heterozygous variant calling at and / or for * CE-D(2)-CE gene conversion at each individual predetermined differentiation site.

[0082] In some embodiments, the methods and systems disclosed herein further include creating a file that includes variant calls. In some embodiments, the file is on a computer storage medium (such as a computer hard drive, for example, a rotating magnetic disk drive, or a solid state drive). In some embodiments, the file is stored in the form of a BAM, SAM, CRAM, or VCF file. In some embodiments, the file is a VCF file.

[0083] Consideration Regarding the Reverse Orientation of the RHD Gene and the RHCE Gene In some embodiments, as depicted in the illustrations of FIGS. 1B and 1C, the RHD gene and the RHCE gene are paralogs with reverse orientations within the genome. Thus, in some embodiments, this method and system take into account the fact that the RHD gene and the RHCE gene have reverse orientations. In some embodiments, the reverse orientation of the RHD gene and the RHCE gene is considered when counting or identifying sequence reads that contain RHD - specific bases or RHCE - specific bases at a given differentiation site.

[0084] For example, in the embodiment of FIG. 1B, a given differentiation site having an RHD - specific base "C" (cytosine) and an RHCE - specific base "A" (adenine) is shown. As seen in the embodiment of FIG. 1C, the sequence reads that align to the RHD gene contain a C at the given differentiation site. If gene conversion from RHD to RHCE occurs at the given differentiation site, due to the reverse orientation of the RHD gene and the RHCE gene, it is predicted that the sequence reads that align to the RHCE gene will contain a "G" (guanine, the base complement of cytosine) at the given differentiation site, as seen in FIG. 1C.

[0085] Thus, in some embodiments, estimating the copy number of RHD-specific bases and RHCE-specific bases at each of a plurality of predetermined differentiation sites of the RHD gene and the RHCE gene includes counting sequence reads that include an RHD-specific base, or its complement, at a predetermined differentiation site among the plurality of predetermined differentiation sites, and counting sequence reads that include an RHCE-specific base, or its complement, at a predetermined differentiation site.

[0086] Embodiments of a sequencing system FIG. 3A shows RHCE * FIG. 3A illustrates a diagram of an environment in which a CE-D(2)-CE detection system 3106 can operate according to one or more implementations. In the following paragraphs, with respect to the exemplary implementations and the diagrams of examples showing the implementations, RHCE * the CE-D(2)-CE detection system will be described. For example, FIG. 3A shows RHCE * FIG. 3A illustrates a schematic diagram of a computing system 3000 in which a CE-D(2)-CE detection system 3106 operates according to one or more implementations. As illustrated, the computing system 3000 includes one or more server devices 3102 connected to a user client device 3108, a local device 3118, and a sequencing device 3114 via a network 3112. The network 3112 can include any suitable network over which computing devices can communicate.

[0087] As shown in FIG. 3A, computing system 3000 includes server device(s) 3102. In various implementations, server device(s) 3102 can generate, receive, analyze, store, and transmit digital data, such as nucleic acid base calls or data of sequenced nucleic acid polymers. In some implementations, server device(s) 3102 receive various data, such as data from a sample genome and / or sequence reads, from sequencing device 3114. Further, server device(s) 3102 can communicate with user client device 3108. In particular, server device(s) 3102 can transmit data regarding sequence reads, direct nucleic acid base calls, nucleic acid base calls, and / or sequencing metrics to user client device 3108.

[0088] As shown, server device(s) 3102 includes sequencing application 3110. Generally, sequencing application 3110 analyzes data (such as call data) received from sequencing device 3114 or other locations to determine the nucleic acid base sequence of a nucleic acid polymer. For example, sequencing application 3110 can receive raw data from sequencing device 3114 and determine the nucleic acid base sequence for a sample genome or nucleic acid segment. In some implementations, sequencing application 3110 determines the sequence of nucleic acid bases in DNA and / or RNA segments, or oligonucleotides.

[0089] As further shown, sequencing application 3110 includes RHCE * CE-D(2)-CE detection system 3106. As described below, RHCE * CE-D(2)-CE detection system 3106 can detect RHCE * CE-D(2)-CE gene conversion events in a nucleic acid sample. For example, in some embodiments, RHCE * CE-D(2)-CE detection system 3106 receives sequence reads obtained from a nucleic acid sample. RHCE* The CE-D(2)-CE detection system 3106 further estimates the copy numbers of the RHD gene and the RHCE gene in the nucleic acid sample. RHCE * The CE-D(2)-CE detection system 3106 further estimates the copy numbers of the RHD-specific bases and the RHCE-specific bases at each of a plurality of predetermined differentiation sites of the RHD gene and the RHCE gene. Based on the estimated copy numbers of the RHD gene and the RHCE gene, and the estimated copy numbers of the RHD-specific bases and the RHCE-specific bases at each of the plurality of predetermined differentiation sites, RHCE * The CE-D(2)-CE detection system 3106 is for RHCE in the nucleic acid sample * The CE-D(2)-CE detection system 3106 can calculate the probability of RHCE gene conversion.

[0090] Furthermore, RHCE * Although the CE-D(2)-CE detection system 3106 is described as being implemented in the server device(s) 3102 as part of the sequencing application 3110, in some implementations, RHCE * The CE-D(2)-CE detection system 3106 is implemented (wholly or partially disposed, etc.) on the user client device 3108, the sequencing device 3114, and / or the local device 3118. As described above, in some implementations, RHCE * The CE-D(2)-CE detection system 3106 is implemented by one or more other components of the computing system 3000, such as the sequencing device 3114. In particular, RHCE * The CE-D(2)-CE detection system 3106 can be implemented in a variety of different ways across the server device(s) 3102, the network 3112, the user client device 3108, the local device 3118, and the sequencing device 3114.

[0091] As can be further seen in FIG. 3A, the computing system 3000 includes a user client device 3108. In various implementations, the user client device 3108 can generate, store, receive, and transmit digital data. In particular, the user client device 3108 can receive data from the sequencing device 3114. As a further example, the user client device 3108 includes a sequencing application 3110. The sequencing application 3110 can be a web application or a native application (e.g., a mobile application, a desktop application, or a web application) stored and executed on the user client device 3108. The sequencing application 3110 can be the sequencing application 3110 and / or RHCE * CE-D(2)-CE detection system 3106 can receive data. For example, the user client device 3108 can receive a variant call file and / or an alignment file from the sequencing application 3110.

[0092] Furthermore, the sequencing application 3110 can (at runtime) cause the user client device 3108 to receive data from the RHCE * CE-D(2)-CE detection system 3106 and cause the user client device 3108 to present data from the sequencing device 3114 and / or the server device(s) 3102. Furthermore, the sequencing application 3110 can instruct the user client device 3108 to display data of variant calls, such as nucleic acid base calls or an indication of a calculated probability regarding RHCE * CE-D(2)-CE gene conversion events. In fact, the user client device 3108 can display nucleic acid base call results of a genomic sample and / or an indication of a predicted RHCE * CE-D(2)-CE gene conversion.

[0093] As further seen in FIG. 3A, computing system 3000 includes a sequencing device 3114. In various implementations, sequencing device 3114 can sequence a genomic sample or other nucleic acid polymers. For example, sequencing device 3114 can analyze nucleic acid segments or oligonucleotides extracted from a genomic sample and generate data either directly or indirectly on sequencing device 3114. More specifically, sequencing device 3114 receives and analyzes nucleic acid sequences extracted from a genomic sample within a nucleotide sample slide (such as a flow cell). In one or more implementations, sequencing device 3114 can sequence a genomic sample or other nucleic acid polymers using SBS. In addition to, or alternatively to, communication via network 3112, in some implementations, sequencing device 3114 bypasses network 3112 and communicates directly with user client device 3108.

[0094] As further depicted in FIG. 3A, in some implementations, server device(s) 3102 includes a collection of distributed servers, and server device(s) 3102 is distributed across network 3112 and includes several server devices located at the same physical location or different physical locations. For example, server device(s) 3102 can be implemented, in whole or in part, on local device 3118. By way of illustration, local device 3118 can implement sequencing application 3110 and / or RHCE * CE-D(2)-CE detection system 3106. Further, server device(s) 3102 and / or local device 3118 can include a content server, an application server, a communication server, a web hosting server, or other types of servers.

[0095] The user client device 3108 illustrated in FIG. 3A can include various types of client devices. For example, in some implementations, the user client device 3108 includes non-mobile devices such as a desktop computer or a server, or other types of client devices. In various implementations, the user client device 3108 includes mobile devices such as a laptop, a tablet, a mobile phone, or a smart phone.

[0096] In FIG. 3A, components of the computing system 3000 that communicate via the network 3112 are illustrated, but in certain implementations, components of the computing system 3000 can also communicate directly with each other, bypassing the network 3112. For example, in some implementations, the user client device 3108 communicates directly with the array determination device 3114. Further, in some implementations, the user client device 3108 * communicates directly with the RHCE * CE-D(2)-CE detection system 3106 and / or the server device(s) 3102. In some implementations, the user client device 3108 communicates directly with the local device 3118. Further, the RHCE

[0097] FIG. 3B is a block diagram of an exemplary server device 3102 that can be used in conjunction with the exemplary array determination system 3000 of FIG. 3A. The server device 3102 is for the RHCE in the nucleic acid sample *It can be configured to detect CE-D(2)-CE gene conversion. The general architecture of the server device 3102 depicted in FIG. 3B includes the arrangement of computer hardware and software components. The server device 3102 may include more (or fewer) elements than those shown in FIG. 3B. However, not all of these general conventional elements need to be shown to provide an effective disclosure. As illustrated, the server device 3102 includes a processing unit 310, a network interface 320, a computer-readable media drive 330, an input / output device interface 340, a display 350, and an input device 360, all of which can communicate with each other via a communication bus. The network interface 320 can provide connectivity to one or more networks or computing systems. Thus, the processing unit 310 can receive information and instructions from other computing systems or services via the network. Further, the processing unit 310 can communicate with the memory 370 and can also provide output information for any display 350 via the input / output device interface 340. Additionally, the input / output device interface 340 can receive input from any input device 360, such as a keyboard, mouse, digital pen, microphone, touch screen, gesture recognition system, voice recognition system, game pad, accelerometer, gyroscope, or other input device.

[0098] Memory 370 may store computer program instructions (grouped as modules or components in some embodiments) that are executed by processing unit 310 to implement one or more embodiments. Generally, memory 370 includes RAM, ROM, and / or other persistent, auxiliary, or non-transitory computer-readable media. Memory 370 can store operating system 372 that provides computer program instructions used by processing unit 310 during general management and operation of server device 3102. Memory 370 can store reference genome 373, such as that used by alignment determination application 3110. Memory 370 may further include computer program instructions for implementing aspects of the present disclosure and other information.

[0099] For example, in one embodiment, memory 370 may include alignment determination application 3110, which may include RHCE * CE-D(2)-CE detection system 3106. RHCE * CE-D(2)-CE detection system 3106 can execute the methods disclosed herein. Additionally, memory 370 may include or communicate with data store 390 and / or one or more other data stores that store one or more inputs, one or more outputs, and / or one or more results (including intermediate results) related to detecting RHCE * CE-D(2)-CE gene conversion in nucleic acid samples of the present disclosure, such as alignment reads, determined candidate haplotypes, and determined variant calls (e.g., detection of RHCE * CE-D(2)-CE gene conversion).

[0100] In some embodiments, the disclosed systems and methods may involve an approach for shifting or distributing certain sequence data analysis functions and sequence data storage devices to a cloud computing environment or a cloud-based network. User interactions with sequencing data, genomic data, or other types of biological data may be mediated through a central hub that stores the data and controls access to various interactions with the data. In some embodiments, the cloud computing environment may also provide for the sharing of protocols, analysis methods, libraries, sequence data, and distributed processing for sequencing, analysis, and reporting. In some embodiments, the cloud computing environment facilitates the modification or annotation of sequence data by a user. In some embodiments, the systems and methods may be implemented on a computer browser, on-demand, or online.

[0101] In some embodiments, software written to execute the methods described herein is stored on some form of computer-readable medium such as a memory, CD-ROM, DVD-ROM, memory stick, flash drive, hard drive, SSD hard drive, server, mainframe storage system, and the like.

[0102] In some embodiments, the method may be written in any of several compiled languages such as various suitable programming languages, e.g., C, C#, C++, Fortran, and Java. Other programming languages may include scripting languages such as Perl, MatLab, SAS, SPSS, Python, Ruby, Pascal, Delphi, R, and PHP. In some embodiments, the method is written in C, C#, C++, Fortran, Java, Perl, R, Java, or Python. In some embodiments, the method may be an independent application having data input and data display modules. Alternatively, the method may be a computer software product, and distributed objects may include classes that include an application that includes the computational methods described herein.

[0103] In some embodiments, the method can be incorporated into existing data analysis software, such as that found in a sequencing instrument. Software that includes the computer-implemented methods described herein can be installed directly on a computer system or indirectly held on a computer-readable medium and loaded onto the computer system as needed. Further, the method can be located on a computer that is remote from where the data is generated, such as software found on a server that maintains the data in a location separate from where it is generated, such as that provided by a third-party service provider.

[0104] An assay instrument, desktop computer, laptop computer, or server can include a processor that operably communicates with an accessible memory that includes instructions for implementation of the system and method. In some embodiments, a desktop computer or laptop computer operably communicates with one or more computer-readable storage media or devices and / or output devices. An assay instrument, desktop computer, and laptop computer can operate under many different computer-based operating languages, such as those utilized by an Apple-based computer system or a PC-based computer system. An assay instrument, desktop, and / or laptop computer and / or server system can further provide a computer interface for creating or modifying experimental definitions and / or conditions, viewing data results, and monitoring experiment progress. In some embodiments, the output device can be a graphic user interface, such as a computer monitor or computer screen, a printer, a handheld device such as a personal digital assistant (i.e., PDA, Blackberry®, iPhone®, etc.), a tablet computer (e.g., iPad), a hard drive, a server, a memory stick, a flash drive, etc.

[0105] A computer-readable memory device or medium can be any device such as a server, mainframe, supercomputer, magnetic tape system, etc. In some embodiments, the memory device can be located in a location proximate to the assay device, e.g., adjacent to or in close proximity to the assay device. For example, the memory device can be located in the same room, the same building, an adjacent building, on the same floor within a building, on different floors within a building, etc., in relation to the assay device. In some embodiments, the memory device can be located outside or distal to the assay device. For example, the memory device can be located in a different location within the city, a different city, a different state, a different country, etc., relative to the assay device. In embodiments where the memory device is located distal to the assay device, communication between the assay device and one or more of a desktop, laptop, or server typically occurs via an Internet connection either wirelessly through an access point or by means of a network cable. In some embodiments, the memory device can be maintained and managed by an individual or entity directly associated with the assay device, while in other embodiments, the memory device can typically be maintained and managed by a third party at a location distal to the individual or entity associated with the assay device. In the embodiments described herein, the output device can be any device for visualizing data.

[0106] An assay device, desktop, laptop, and / or server system may be used to store and / or retrieve a computer-implemented software program incorporating computer code for performing and implementing the calculation methods described herein, data for use in the implementation of the calculation methods, and the like. One or more of the assay device, desktop, laptop, and / or server may include one or more computer-readable storage media for storing and / or retrieving a software program incorporating computer code for performing and implementing the calculation methods described herein, data for use in the implementation of the calculation methods, and the like. The computer-readable storage media may include, but is not limited to, one or more of a hard drive, SSD hard drive, CD-ROM drive, DVD-ROM drive, floppy disk, tape, flash memory stick or card. Further, a network including the Internet may be a computer-readable storage media. In some embodiments, the computer-readable storage media refers to a computational resource storage accessible via a computer network or a company network provided by a service provider over the Internet, rather than from a local desktop or laptop computer at a distal location to the assay device, for example.

[0107] In some embodiments, the computer-implemented software program incorporating computer code for performing and implementing the calculation methods described herein, and the computer-readable storage media for storing and / or retrieving data used in the implementation of the calculation methods, are operated and maintained by a service provider that is operably communicable with the assay device, desktop, laptop, and / or server system via an Internet connection or network connection.

[0108] In some embodiments, the hardware platform for providing a computing environment includes a processor (i.e., a CPU) where memory layout such as processor time and random access memory (i.e., RAM) is a consideration of the system. For example, smaller computer systems provide inexpensive, high-speed processors and large memory and storage capabilities. In some embodiments, a graphics processing unit (GPU) can be used. In some embodiments, the hardware platform for executing the computing methods described herein includes one or more computer systems having one or more processors. In some embodiments, smaller computers are clustered together to form a supercomputer network.

[0109] In some embodiments, the computing methods described herein are executed on an aggregate of inter- or intra-connected computer systems (i.e., grid technology) that can cooperatively execute various operating systems. For example, the CONDOR framework (University of Wisconsin-Madison) and systems available from United Devices are examples of the cooperation of multiple independent computer systems for the purpose of handling large amounts of data. These systems can provide a Perl interface for submitting, monitoring, and managing large array analysis jobs on clusters in a serial or parallel configuration.

Example

[0110] Some aspects of the above-described embodiments are disclosed in more detail in the following examples, which are not intended to limit the scope of the present disclosure. Those skilled in the art will understand that many other embodiments are also within the scope of the present disclosure, as described above and in the claims.

[0111] Example 1 For the RHD gene and the RHCE gene, the reference genome sequences were aligned with each other, and all sites with single nucleotide differences between the two gene sequences were selected. The positions of these differentiation sites in both the RHD and RHCE genes were remembered.

[0112] In a project known as the 1000 Genomes Project, nucleic acid samples from a diverse population cohort of approximately 3200 individuals were profiled by Illumina® sequencing. Using short sequence reads from the nucleic acid samples, it was determined whether each single nucleotide difference between RHD and RHCE was fixed across the population. For this purpose, a subset of samples with an estimated diploid copy number of RHD+RHCE equal to 4 was selected and limited to samples without copy number variation. If a significant proportion (10% or more) of the differentiation sites between RHD and RHCE had a proportion of reads supporting the RHD-specific base (RHD allele) and the RHCE-specific base (RHCE allele), another set of samples was filtered out as conflicting with the hypothesis (diploid hypothesis) that the sample had two copies of each gene. In this step, samples for which the diploid hypothesis broke down for any gene, or samples with large-scale gene conversion events, were excluded.

[0113] Using a subset of the filtered samples, each site with differentiation between the RHD gene and the RHCE gene was filtered based on how consistently the site had a proportion of reads supporting the RHD allele or the RHCE allele that was consistent across the entire selected set of samples for each gene. A site was selected as a "fixed differentiation site" if at least 98% of the population samples had a similar proportion of reads supporting the RHD allele and the RHCE allele. If these proportions did not match, the site was excluded from the list of fixed differentiation sites. 793 differentiation sites were determined from the RHCE and RHD genes, which were single base pair differences in the homologous regions of RHCE and RHD and were found to be fixed in the population (occurring in more than 98% of the population).

[0114] Example 2 Sequence reads aligned to the RHCE gene or the RHD gene from the HG002 reference genome were incorporated as input. The copy number of both the RHCE and RHD genes was estimated from the read depth of the reads aligned in the RH gene region and normalized with the read depth of 3000 normalized regions.

[0115] RHCE in the population * A file containing two pairs of differentiated sites adjacent to the CE-D(2)-CE breakpoint site and RHCE * A file containing the potential haplotypes of the CE-D(2)-CE variant were provided as input. RHCE * For the CE-D(2)-CE gene conversion event, two breakpoints were identified, and their corresponding differentiated sites were located at chr1:25405587 and chr1:25405596 (hg38) for the first breakpoint, and chr1:25409676 and chr1:25409958 (hg38) for the second breakpoint.

[0116] Candidate haplotypes were formed by a series of extension steps using all reads overlapping a given differentiated site between the gene and its paralog, and the total number of haplotypes was obtained from the copy number of the RHCE gene and the RHD gene. The set of candidate haplotypes was formed from all assumed bases at the first given differentiated site. The haplotypes were extended at the next given differentiated site by considering all reads that could be uniquely assigned to a single candidate haplotype. If such a read supports only a single base at the next differentiated site of a given candidate haplotype, the haplotype is extended with that base. If a candidate haplotype can be extended by both bases at the next given differentiated site, both extended assumed haplotypes are included in the set of candidate haplotypes and the set increases by only one. Subsequent extension steps were performed at adjacent given differentiated sites until all given differentiated sites were processed.

[0117] RHCE *To detect the CE-D(2)-CE recombinant variant, a haplotype supporting the recombinant variant at the first breakpoint and a haplotype supporting the recombinant variant adjacent to the second breakpoint were identified. Since both of the identified candidate haplotypes supported the recombinant variant at the breakpoint, RHCE * The CE-D(2)-CE recombinant variant was detected.

[0118] RHCE * After detecting the CE-D(2)-CE gene conversion, the copy number of a predetermined differentiation site included in the gene conversion region was evaluated based on the number of reads storing the RHCE-specific bases at the predetermined differentiation site. When the estimated RHCE copy number of the predetermined differentiation site was 0, a homozygous variant was called for that predetermined differentiation site. When the estimated RHCE copy number of the predetermined differentiation site was 1, a heterozygous variant was called for that predetermined differentiation site. A VCF format file containing the variant call was stored.

[0119] A VCF format file containing the variant calls generated by the method described in this example, as well as variant calls by other common variant calling methods, was compared with a "true VCF" file containing the variant calls considered to be the most representative of the HG002 sample. Also, the "true VCF" file was compared with a variant call file generated using a variant calling method not specific to CE-D(2)-CE gene conversion. As seen in Figure 4, RHCE * was also compared with a variant call file generated using a variant calling method not specific to CE-D(2)-CE gene conversion. As shown in Figure 4, RHCE * Implementation of the system and method embodiments for detecting CE-D(2)-CE gene conversion reduced 66 false negative variant calls, that is, 66 new SNPs were accurately called as variants.

[0120] Other considerations The embodiments described in this specification are exemplary. Modifications, rearrangements, alternative processes, etc. may be made to these embodiments and still be encompassed within the teachings described in this specification. One or more of the steps, processes, or methods described herein may be preferably performed by one or more suitably programmed processes and / or digital devices.

[0121] The various illustrative imaging or data processing techniques described in connection with the embodiments disclosed herein can be implemented as electronic hardware, computer software, or a combination of both. To illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. The described functionality may be implemented in various manners for each particular application, but such implementation decisions should not be construed as causing a departure from the scope of the present disclosure.

[0122] The various exemplary detection systems described in connection with the embodiments disclosed herein can be implemented or executed by a machine device such as a processor configured with specific instructions, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), or other programmable logic devices, discrete gates or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. The processor may be a microprocessor, but alternatively, the processor may be a controller, a microcontroller, or a state machine, or a combination thereof. Also, the processor may be implemented as a combination of computing devices, such as, for example, a combination of a DSP and a microprocessor, multiple microprocessors, one or more microprocessors using a combined DSP core, or other such configurations. For example, the systems described herein can be implemented using discrete memory chips, a portion of the memory within a microprocessor, flash, EPROM, or other types of memory.

[0123] The elements of the methods, processes, or algorithms described in connection with the embodiments disclosed in this specification may be embodied directly in hardware, in software modules executed by a processor, or in a combination of the two. The software modules can reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of computer-readable storage medium known in the art. An exemplary storage medium can be coupled to the processor such that the processor can read information from, and write information to, the storage medium. Alternatively, the storage medium can be integral to the processor. The processor and the storage medium can reside in an ASIC. The software modules can include computer-executable instructions that cause a hardware processor to execute the computer-executable instructions.

[0124] In particular, conditional language used in this specification such as "can", "might", "may", "for example", etc., generally conveys that, unless otherwise specified or understood differently within the context in which it is used, a particular embodiment includes a particular feature, element, and / or state, while other embodiments do not include that particular feature, element, and / or state. Thus, such conditional language generally does not mean that a feature, element, and / or state is necessary in any way for one or more embodiments, or that one or more embodiments necessarily include logic for determining whether these features, elements, and / or states are included or are implemented in any particular embodiment, with or without author input or prompting. Terms such as "comprises", "including", "has", "involving", etc. are synonymous and are used in an inclusive open-ended manner, not excluding additional elements, features, acts, operations, etc. Also, the term "or", when used, for example, to connect a list of elements, is used in its inclusive sense (not its exclusive sense) such that the term "or" means one, some, or all of the elements in the list.

[0125] Unless otherwise stated, disjunctive language such as "at least one of X, Y, or Z" is generally understood within the context to present that an item, term, etc. may be any of X, Y, or Z, or a combination thereof, as would normally be used to indicate such. Thus, such disjunctive language generally does not and is not intended to imply that a particular embodiment requires the presence of at least one of X, at least one of Y, or at least one of Z, respectively.

[0126] Terms such as "about" or "approximately" are synonymous and are used to indicate that the value modified by such terms has an understood range associated therewith, which may be ±20%, ±15%, ±10%, ±5%, or ±1%. The term "substantially" is used to indicate that the result (such as a measured value) is close to the target value, where close, for example, can mean that the result is within 80% of the value, within 90% of the value, within 95% of the value, or within 99% of the value.

[0127] Unless otherwise specified, articles such as "a" or "an" should generally be construed to include one or more of the items described. Thus, phrases such as "a device configured to" or "a device for" are intended to include one or more of the recited devices. Such one or more recited devices may also be collectively configured to perform the recited detailed description. For example, "a processor configured to perform detailed descriptions A, B, and C" can include a first processor to operate in conjunction with a second processor configured to perform detailed description A and configured to perform detailed descriptions B and C.

[0128] The above detailed description has shown, described, and pointed out novel features applicable to exemplary embodiments, but it will be understood that various omissions, substitutions, and changes in the form and details of the devices or algorithms shown can be made without departing from the spirit of the present disclosure. As will be recognized, certain embodiments described herein may be embodied in forms that do not provide all of the features and advantages described herein, as some features may be used or implemented separately from others. All changes that fall within the meaning and equivalent scope of the claims are included within their scope.

[0129] It should be understood that all combinations of the foregoing concepts (subject to the condition that such concepts are not mutually inconsistent) are intended to be part of the subject matter of the invention disclosed herein. Specifically, all combinations of the claimed subject matter that appear at the end of this disclosure are intended to be part of the subject matter of the invention disclosed herein.

[0130] The scope of this disclosure is not intended to be limited by the specific disclosure of the examples in this section or elsewhere in this specification, but may be defined by the claims presented in this section or elsewhere in this specification, or claims to be presented in the future. The language of the claims should be broadly construed based on the language used in the claims and not limited to the examples described in this specification or during the prosecution of this application, and the examples should be construed as non-exclusive.

Explanation of Reference Signs

[0131] 310 Processing Unit 320 Network Interface 330 Computer-Readable Medium 340 Input / Output Device Interface 350 Display 360 Input Device 370 Memory 372 Operating System 373 Reference Genome 390 Data Store 3000 Computing System 3102 Server Device(s) 3106 RHCE * CE-D(2)-CE Detection System 3108 User Client Device 3110 Array Determination Application 3112 Network 3114 Array Determination Device 3118 Local Device

Claims

1. RHCE in nucleic acid samples ＊ A computer-based method for detecting a CE-D(2)-CE gene conversion event, wherein the method is: Receiving sequence reads that align to the RHD gene or RHCE gene, The composite copy number of the RHD gene and the RHCE gene in the nucleic acid sample is estimated by counting the sequence reads that align with the RHD gene or the RHCE gene. The process involves counting sequence reads containing RHD-specific bases at multiple predetermined differentiation sites of the RHD gene and the RHCE gene, and counting sequence reads containing RHCE-specific bases at the same predetermined differentiation sites, thereby estimating the copy numbers of RHD-specific bases and RHCE-specific bases at each of the multiple predetermined differentiation sites. The gene-specific copy number in each of the plurality of predetermined differentiation sites is estimated based on the value obtained by multiplying the proportion of sequence reads containing RHD-specific bases or RHCE-specific bases in the predetermined differentiation sites by the estimated composite copy number of the RHD gene and the RHCE gene. Based on the estimated composite copy number of the RHD gene and the RHCE gene, and the estimated copy number of the RHD-specific bases and the RHCE-specific bases in each of the plurality of predetermined differentiation sites, the RHCE in the nucleic acid sample is determined. ＊ To calculate the probability of CE-D(2)-CE gene conversion and Methods that include...

2. The RHCE ＊ The method according to claim 1, wherein a first breakpoint is generated by CE-D(2)-CE gene conversion, and the plurality of predetermined differentiation sites include at least two predetermined differentiation sites adjacent to the first breakpoint.

3. The method according to claim 2, further comprising identifying one or more sequence reads that span the first breakpoint and contain an RHD-specific base in a first predetermined differentiation site adjacent to the first breakpoint, and contain an RHCE-specific base in a second predetermined differentiation site adjacent to the first breakpoint.

4. The RHCE ＊ CE-D(2)-CE gene conversion creates a second breakpoint, and the plurality of predetermined differentiation sites include at least two predetermined differentiation sites adjacent to the second breakpoint. The method according to claim 3, further comprising identifying one or more sequence reads that span the second breakpoint and contain an RHD-specific base in a first predetermined differentiation site adjacent to the second breakpoint, and contain an RHCE-specific base in a second predetermined differentiation site adjacent to the second breakpoint.

5. RHCE ＊ The method according to claim 1, wherein calculating the probability of CE-D(2)-CE gene conversion includes detecting changes in gene-specific copy number at consecutive predetermined differentiation sites.

6. Estimating the number of composite copies is Normalizing the count of sequence reads aligned to the RHD gene or the RHCE gene, Applying a Gaussian mixture model and The method according to claim 1, including the method described in claim 1.

7. The method according to any one of claims 1 to 6, wherein the orientation of the RHD gene and the RHCE gene is reversed.

8. The plurality of predetermined differentiation regions are, Identifying single base differences between the sequences of the RHD gene and the RHCE gene in the reference sequence, Selecting a single base difference that is fixed across the entire population as a differentiation site. The method according to any one of claims 1 to 6, as identified by

9. Selecting a single base difference that is fixed across the entire population as a differentiation site is For multiple nucleic acid samples, multiple sequence reads are received to be aligned to the RHD gene and the RHCE gene, For each of the aforementioned plurality of nucleic acid samples, the gene-specific copy number of the RHD gene and the copy number of the RHCE gene are estimated. The method involves selecting a subset of nucleic acid samples from the aforementioned plurality of nucleic acid samples, wherein the subset of nucleic acid samples includes nucleic acid samples that are presumed to be diploid with respect to the RHD gene and presumed to be diploid with respect to the RHCE gene. In at least 90% of the nucleic acid samples from the subset of nucleic acid samples, a single base difference is selected that has a copy number consistent with the diploid RHD gene and the RHCE gene. The method according to claim 8, including the method described in claim 8.

10. The method according to any one of claims 1 to 6, further comprising constructing one or more candidate haplotypes by phasing the predetermined differentiation sites using the sequence reads aligned to the RHD region or the RHCE region, based on linkage information between predetermined differentiation sites in the plurality of predetermined differentiation sites using the sequence reads aligned to the RHD region or the RHCE region.

11. The one or more candidate haplotypes are the RHCE ＊ The method according to claim 10, which covers the breakpoint region of CE-D(2)-CE gene conversion.

12. Fading the predetermined differentiated region is Based on all sequencing bases of the first predetermined differentiation site, one or more candidate haplotypes are constructed. By aligning the sequence reads of the RHD gene or the RHCE gene, the one or more candidate haplotypes are extended to a second predetermined differentiation site. The method according to claim 10, including the method described in claim 10.

13. The method according to any one of claims 1 to 6, further comprising performing variant calling at the predetermined differentiation sites in the plurality of predetermined differentiation sites.

14. The RHCE ＊ The method according to any one of claims 1 to 6, further comprising performing variant calling for CE-D(2)-CE gene conversion.

15. The method according to any one of claims 1 to 6, wherein the predetermined differentiation site includes a site corresponding to a position selected from chr1:25405587, chr1:25405596, chr1:25409676, or chr1:25409958 of the reference genome hg38.