A method for structural variation calling and typing suitable for long read family sample sequencing
By using long-read family sample sequencing methods, combined with Mendelian laws of inheritance and linkage information, the problem of high-coverage sequencing dependence in existing technologies has been solved, achieving accurate structural variation detection and typing, and reducing costs.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- HARBIN INST OF TECH
- Filing Date
- 2026-03-25
- Publication Date
- 2026-06-26
AI Technical Summary
Existing family-based structural variation detection methods heavily rely on high-coverage sequencing, resulting in insufficient utilization of genetic characteristics, inability to accurately detect and genotype structural variations, and high costs associated with sequencing multiple samples.
By employing long-read family sample sequencing, the genome sequences of family members are obtained. Then, using Mendel's laws of inheritance and linkage information from long-read sequencing fragments, structural variations are clustered and genotypes are corrected to achieve accurate structural variation detection and typing.
It reduces sequencing costs and enables accurate detection and typing of structural variations, while improving detection efficiency by utilizing pedigree information.
Smart Images

Figure CN122290708A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to a method for identifying and classifying structural variations. Background Technology
[0002] Structural variations (SVs) are genomic alterations exceeding 50 nucleotides in length, encompassing various molecular mechanisms such as insertions (INS), deletions (DEL), inversions (INV), duplications (DUP), translocations (TRA), and complex combinatorial rearrangements. SVs are characterized by widespread nucleotide-level variations within their genomic regions, profoundly impacting gene coding, clinical diseases, and phenotypic traits; their functional significance often surpasses that of other types of genetic variations. New evidence from long-read sequencing studies further reveals the widespread sharing of SV loci in human populations, particularly among directly related individuals.
[0003] Currently, SV detection methods based on long-read sequencing data have reached their performance limits, and existing mature tools have pushed individual-level SV detection to its methodological limits. Simultaneously, the effectiveness of SV haplotype typing strategies using single nucleotide variants (SNVs), which utilize SNV haplotype typing and intra-sequence linkage information for SV haplotype typing, has also reached its bottleneck. To address these limitations, emerging methods advocate the use of family trio sequencing data, and family-based methods have shown promising promise in SNV detection. In contrast, family-based SV detection remains an underdeveloped research area, generally heavily reliant on high-coverage sequencing, and suffers from insufficient utilization of genetic characteristics and the high cost of sequencing multiple samples. Therefore, there is a need to develop a computationally efficient framework that utilizes family information and can systematically leverage the biological correlations in family trio data to achieve accurate SV detection. Summary of the Invention
[0004] The purpose of this invention is to address the problems of existing family-based SV detection, which heavily relies on high-coverage sequencing, resulting in insufficient utilization of genetic characteristics and thus inaccurate SV detection and typing, as well as the high cost of sequencing multiple samples. In response, this invention proposes a structural variation identification and typing method applicable to long-read family sample sequencing.
[0005] The specific process of a structural variant identification and typing method applicable to long-read family sample sequencing is as follows:
[0006] Step 1: Obtain the genome sequences of three members of each family; compare each member's genome sequence with the human reference genome to obtain the structurally variable genes and the unvariated genes for each member; the structurally variable genes include the length, location, type of variation, and origin of the variated genes; the unvariated genes include the length, location, and origin of the genes; place the structurally variable genes of the three members into the corresponding family structural variation feature set;
[0007] Step 2: Cluster the structural variation genes in each family based on their location and length in the structural variation feature set to obtain the clustering results for each family;
[0008] Based on the number of sequencing fragments supporting the alternative allele and the number of sequencing fragments supporting the reference allele within each cluster of each family, the three genotypes within each cluster of each member of each family are calculated. , and The likelihood probabilities of each;
[0009] Select the genotype corresponding to the maximum likelihood probability as the initial genotype for each member of each family and each cluster;
[0010] Step 3: Based on the preliminary genotypes of each member and each cluster within each family, determine whether the preliminary genotype of each cluster belongs to the first type of error, the second type of error, the third type of error, or no error. If it belongs to the first type of error, correct it; if it belongs to the second type of error, correct it; if it belongs to the third type of error, correct it; if it belongs to no error, do not correct it. This yields the final three genotypes for each cluster. , and Triple probability tuples ;
[0011] The final three genotypes for each cluster , and Triple probability tuples Convert to binary probability tuples of the occurrence of SV on two haploids For binary probability tuples Solving for the probability tuples of the offspring yields a binary probability set.
[0012] This indicates the probability of SV appearing in the first haplotype. This indicates the probability of SV appearing on the second haplotype;
[0013] Step 4: Perform the first handshake on the binary probability tuples of the offspring obtained in Step 3 to obtain each anchored mutation;
[0014] For each anchoring variant, extend backward as far as possible to obtain the haploid block corresponding to each anchoring variant;
[0015] Based on the haploid block corresponding to each anchored variant, the process is extended forward as far as possible for each anchored variant to obtain the final haploid block corresponding to each anchored variant.
[0016] Based on the final haploid block corresponding to each anchored variant, determine whether there are duplicate sequencing fragments in adjacent haploid blocks; if so, merge adjacent blocks into a new block and retain the duplicate region only once; if not, do not merge them into a new block.
[0017] The two alleles of the non-anchored variant within each block have the same parental origin as the two alleles of the anchored variant.
[0018] The beneficial effects of this invention are as follows:
[0019] This invention proposes a method for structural variant (SV) identification and genotyping in long-read family pedigree samples. The method uses individual sequencing data of all family members as input, extracts the variant features of each member, performs cluster analysis on the family feature set, assigns the features to their respective members, and then uses three family feature signal correction methods to correct detection errors. Finally, the method locates and anchors SVs using Mendelian inheritance laws, and performs haplotype typing of the SVs using linkage information in the long-read sequencing fragments. Based on these techniques, this invention enables accurate detection and genotyping of family SVs, reducing sequencing costs. Attached Figure Description
[0020] Figure 1 This is a flowchart of the present invention. Detailed Implementation
[0021] Specific Implementation Method 1: The specific process of this implementation method for structural variation identification and typing of long-read family sample sequencing is as follows:
[0022] Step 1: Obtain the genome sequences of three members of each family; compare each member's genome sequence with the human reference genome to obtain the structurally variable genes and the unvariated genes for each member; the structurally variable genes include the length, location, type of variation, and origin of the variated genes; the unvariated genes include the length, location, and origin of the genes; place the structurally variable genes of the three members into the corresponding family structural variation feature set;
[0023] Step 2: Cluster the structural variation genes in each family based on their location and length in the structural variation feature set to obtain the clustering results for each family;
[0024] Based on the number of sequencing fragments supporting the alternative allele and the number of sequencing fragments supporting the reference allele within each cluster of each family, the three genotypes within each cluster of each member of each family are calculated. , and The likelihood probabilities of each;
[0025] Select the genotype corresponding to the maximum likelihood probability as the initial genotype for each member of each family and each cluster;
[0026] Step 3: Based on the preliminary genotypes of each member and each cluster within each family, determine whether the preliminary genotype of each cluster belongs to the first type of error, the second type of error, the third type of error, or no error. If it belongs to the first type of error, correct it; if it belongs to the second type of error, correct it; if it belongs to the third type of error, correct it; if it belongs to no error, do not correct it. This yields the final three genotypes for each cluster. , and Triple probability tuples ;
[0027] The final three genotypes for each cluster , and Triple probability tuples Convert to binary probability tuples of the occurrence of SV on two haploids For binary probability tuples Solving for the probability tuples of the offspring yields a binary probability set.
[0028] This indicates the probability of SV appearing in the first haplotype. This indicates the probability of SV appearing on the second haplotype;
[0029] Step 4: Perform the first handshake on the binary probability tuples of the offspring obtained in Step 3 to obtain each anchored mutation;
[0030] For each anchored variant, extend backward as far as possible to obtain the haploid block corresponding to each anchored variant (second handshake).
[0031] Based on the haploid block corresponding to each anchored variant, the range of the haploid block is extended forward as much as possible for each anchored variant to obtain the final haploid block corresponding to each anchored variant (the range of the haploid block is extended forward) (third handshake).
[0032] Based on the final haploid block corresponding to each anchored variant, determine whether there are duplicate sequencing fragments in adjacent haploid blocks; if so, merge adjacent blocks into a new block and retain the duplicate region only once; if not, do not merge them into a new block.
[0033] The two alleles of the non-anchored variant within each block (including the final haploid block and the new block) have the same parental origin as the two alleles of the anchored variant.
[0034] Specific Implementation Method Two: This implementation method differs from Specific Implementation Method One in that, in step one, the genome sequences of three members of each family are obtained; the genome sequence of each member is compared with the human reference genome to obtain the structurally mutated genes and the unmutated genes of each member; the structurally mutated genes include the length, location, mutation type, and origin of the mutated genes; the unmutated genes include the length, location, and origin of the genes; the structurally mutated genes of the three members are placed into the corresponding family structural variation feature set;
[0035] The specific process is as follows:
[0036] Step 11: Obtain the genome sequences of the three members of family A;
[0037] Steps one and two
[0038] The genome sequence of the father of family A is compared with the human reference genome to obtain the genes with structural variations and the genes without variations in the father.
[0039] The genome sequence of the mother in family A is compared with the human reference genome to obtain the genes with structural variations and the genes without variations in the mother.
[0040] The genome sequence of the offspring of family A is compared with the human reference genome to obtain the genes with structural variations and the genes without variations in the offspring.
[0041] Structural variations in genes include the length, location, type of variation (such as insertion, deletion, inversion, duplication, and translocation), and origin (from father, mother, or offspring).
[0042] Unmutated genes include the gene's length, location, and origin (from the father, mother, or offspring).
[0043] Record the number of unmutated genes in family A;
[0044] The number of unmutated genes in family A is the sum of the number of unmutated genes in the father, the number of unmutated genes in the mother, and the number of unmutated genes in the offspring.
[0045] Step 13
[0046] The structural variation genes of the father, the structural variation genes of the mother, and the structural variation genes of the offspring are placed into the structural variation feature set of the Ath family.
[0047] Record the number of genes with structural variations in the structural variation feature set of family A.
[0048] The other steps and parameters are the same as in Specific Implementation Method 1.
[0049] Specific Implementation Method 3: This implementation method differs from Specific Implementation Method 1 or 2 in that, in step 2, clustering is performed based on the location and length of genes with structural variations in the structural variation feature set of each family to obtain the clustering results for each family.
[0050] Based on the number of sequencing fragments supporting the alternative allele and the number of sequencing fragments supporting the reference allele within each cluster of each family, the three genotypes within each cluster of each member of each family are calculated. , and The respective likelihood probabilities (MLE);
[0051] Select the genotype corresponding to the maximum likelihood probability as the initial genotype for each member of each family and each cluster;
[0052] The specific process is as follows:
[0053] Step 2: First, perform the first clustering based on the location of genes with structural variations in the structural variation feature set of each family to obtain the first clustering results for each family;
[0054] Step 22: Based on the results of the first clustering, perform a second clustering within each cluster of each family based on the length of each structural variant gene, and obtain the second clustering results;
[0055] Steps 2 and 3: Based on the results of the second clustering, obtain the number of sequencing fragments supporting alternative alleles and the number of sequencing fragments supporting reference alleles within each cluster;
[0056] Based on the number of sequencing fragments supporting the alternative allele and the number of sequencing fragments supporting the reference allele within each cluster, the three genotypes within each cluster for each member of each family are calculated. , and The respective maximum likelihood (MLE);
[0057] genotype This indicates no mutation; Indicates heterozygous variation; Indicates homozygous variation;
[0058] choose , , The genotype corresponding to the maximum value is used as the initial genotype for each member of each family and each cluster.
[0059] Other steps and parameters are the same as in specific implementation method one or two.
[0060] Specific Implementation Method Four: This implementation method differs from Specific Implementation Methods One to Three in that, in step two-one, the first clustering is performed based on the location of genes with structural variations in the structural variation feature set of each family, resulting in the first clustering result for each family; the specific process is as follows:
[0061] Step Two 11
[0062] Collect all the structural variation characteristics of each family Genes with structural variations are arranged from front to back according to their positions;
[0063] The position of the first structural variant gene after sorting is The location of the second structural variant gene after sorting is The location of the third structural variant gene after sorting is The position of the fourth structural variant gene after sorting is And so on, after sorting, the first... The location of the structurally variable genes is as follows: After sorting, the first The location of the structurally variable genes is as follows: ;
[0064] Step 212, Order , Indicates the ordinal index; after sorting, the first... Each structurally variable gene is placed in the first cluster;
[0065] Steps 2, 1, and 3: Compare and sort the results. Location of structurally variable genes And after sorting the first Location of structurally variable genes ;
[0066] like If the following requirement is met, then The corresponding sorted number The structural variant gene and the sequenced first If the structurally variable genes belong to the same cluster, proceed to step two, one, and four.
[0067] like If the following requirement is not met, then The corresponding sorted number The structural variant gene and the sequenced first Since the structural variant genes do not belong to the same cluster, construct a new cluster and proceed to step two, one, and four.
[0068]
[0069] Step 214, Order Repeat steps two through three until the sorting is complete. The location of each structurally variable gene was used to obtain the first clustering result. Clusters, , This indicates the first clustering result. Clusters.
[0070] The other steps and parameters are the same as those in one of the specific implementation methods one to three.
[0071] Specific Implementation Method 5: This implementation method differs from Specific Implementation Methods 1 to 4 in that, in step 22, based on the results of the first clustering, a second clustering is performed within each cluster of each family based on the length of each structural variant gene, to obtain the second clustering results;
[0072] The specific process is as follows:
[0073] Step 221, Order , This represents the index number of the cluster in the first clustering result;
[0074] make , An index representing the structurally variable genes within each cluster;
[0075] Step 222, compare the first Within the cluster, the first The length of each structural variant gene and the Within the cluster, the first The length of each structural variant gene ;
[0076] like If the following requirements are met, then proceed to steps two, two, and three;
[0077] like If the following requirement is not met, then The corresponding structural variant genes all the way up to the 1st The last structural variant gene within a cluster belongs to a new cluster (at this point, the number of clusters is greater than the result of the first clustering). (a cluster);
[0078]
[0079] Steps two, two, three, command Repeat step two until the first step is completed. All structurally variable genes within a cluster are used to obtain the target gene for the first cluster. A new cluster of clusters;
[0080] Step 224, Order Repeat steps 222 to 223 until all clusters have been identified, and obtain the second clustering result (the number of clusters in the second clustering result is greater than the number of clusters in the first clustering result because there are newly added clusters).
[0081] The other steps and parameters are the same as those in one of the specific implementation methods one to four.
[0082] Specific Implementation Method Six: This implementation method differs from Specific Implementation Methods One to Five in that, in steps two and three, the number of sequencing fragments supporting alternative alleles and the number of sequencing fragments supporting reference alleles within each cluster are obtained based on the results of the second clustering.
[0083] Based on the number of sequencing fragments supporting the alternative allele and the number of sequencing fragments supporting the reference allele within each cluster, the three genotypes within each cluster for each member of each family are calculated. , and The respective maximum likelihood (MLE);
[0084] genotype This indicates no mutation; Indicates heterozygous variation; Indicates homozygous variation;
[0085] choose , , The genotype corresponding to the maximum value is used as the initial genotype for each member of each family and each cluster;
[0086] The specific process is as follows:
[0087] Step 231: Based on the results of the second clustering, obtain the number of sequencing fragments supporting alternative alleles and the number of sequencing fragments supporting reference alleles within each cluster;
[0088] Based on the number of sequencing fragments supporting the alternative allele and the number of sequencing fragments supporting the reference allele within each cluster, the three genotypes within each cluster for each member of each family are calculated. , and The maximum likelihood probabilities of each;
[0089] The specific process is as follows:
[0090] The number of sequencing fragments supporting alternative alleles within each cluster is equal to the total number of structurally variable genes within the cluster.
[0091] The number of sequencing fragments supporting the reference allele within each cluster is equal to the total number of structurally unvariated genes within the cluster.
[0092] Calculate the three genotypes within each cluster for each member of each family. , and The maximum likelihood (MLE) of each is expressed as:
[0093]
[0094] in, Indicates genotype The maximum likelihood probability; Indicates genotype The maximum likelihood probability; Indicates genotype The maximum likelihood probability;
[0095] This represents the probability (a set value) that a sequencing fragment is incorrectly matched to a haploid.
[0096] Indicates the number of sequencing fragments that support alternative alleles; Indicates the number of sequencing fragments supporting the reference allele;
[0097] Step Two Three Two, Select , , The genotype corresponding to the maximum value is used as the initial genotype for each member of each family and each cluster.
[0098] The other steps and parameters are the same as those in one of the specific implementation methods one to five.
[0099] Specific Implementation Method Seven: This implementation method differs from Specific Implementation Methods One through Six in that, in step three, based on the preliminary genotype of each member of each family and each cluster, it determines whether the preliminary genotype of each cluster belongs to the first type of error, the second type of error, the third type of error, or no error. If it belongs to the first type of error, the first type of error is corrected; if it belongs to the second type of error, the second type of error is corrected; if it belongs to the third type of error, the third type of error is corrected; if it belongs to no error, no correction is made, thus obtaining the final three genotypes for each cluster. , and Triple probability tuples ;
[0100] The final three genotypes for each cluster , and Triple probability tuples Convert to binary probability tuples of the occurrence of SV on two haploids For binary probability tuples Solving for the probability tuples of the offspring yields a binary probability set.
[0101] This indicates the probability of SV appearing in the first haplotype. This indicates the probability of SV appearing on the second haplotype;
[0102] The specific process is as follows:
[0103] Step 3: Based on the preliminary genotypes of each member and each cluster in each family, determine whether the preliminary genotype of each cluster belongs to the first type of error. If it belongs to the first type of error, correct the preliminary genotype of the cluster to obtain the final three genotypes for each cluster. , and Triple probability tuples Based on the final three genotypes of each cluster , and Triple probability tuples Proceed to steps three and four; if the error does not fall under the first category, proceed to step three and two.
[0104] The specific process is as follows:
[0105] Determine if each cluster satisfies the following formula:
[0106]
[0107] in, This indicates the number of genes within a specific cluster for a particular member of each family (genes include the number of sequencing fragments supporting the reference allele and the number of sequencing fragments supporting the alternative allele).
[0108] This indicates that the number of sequencing fragments supporting the reference allele in a certain cluster for the first member of family A is 0, and the number of sequencing fragments supporting the alternative allele in the cluster is also 0.
[0109] This represents the average length of all structurally variable genes within the cluster corresponding to the second member of family A;
[0110] This represents the average length of all structurally variable genes within the cluster corresponding to the 3rd member of family A;
[0111] If the condition is met, the corresponding cluster belongs to the first type of error, and the initial genotype of the corresponding cluster is modified to the correct genotype. Proceed to steps three and four;
[0112] If the condition is not met, proceed to step three two.
[0113] The first error in structural variant detection is that the variant length is too long, causing sequencing fragments from different sequencing platforms to fail to completely cover the variant region near the variant location, resulting in a complete break in the alignment results obtained by the alignment tool at that location. To address this situation, this technique uses Mendelian laws of inheritance to determine the most probable genotype of a member. The first type of error occurs when a member of a family has no variant, while other family members are identified as heterozygous or homozygous at that SV locus.
[0114] Step 3.2: Based on the preliminary genotypes of each member and each cluster in each family, determine whether the preliminary genotype of each cluster belongs to the second type of error. If it belongs to the second type of error, correct the preliminary genotype of the cluster to obtain the final three genotypes for each cluster. , and Triple probability tuples Based on the final three genotypes of each cluster , and Triple probability tuples Proceed to steps three and four; if the error does not fall under the second category, proceed to step three and three.
[0115] The specific process is as follows:
[0116] Determine if each cluster satisfies the following formula:
[0117]
[0118] in,
[0119] This represents the number of structurally variable genes within a certain cluster in the first member of family A;
[0120] This indicates the number of structural variant genes preset by the user;
[0121] This indicates that the first member of family A has no variation within a certain cluster;
[0122] This indicates that the other two members in family A are identified as heterozygous or homozygous at the SV locus in the same cluster;
[0123] Indicates genotype The maximum likelihood probability; Indicates genotype The maximum likelihood probability; Indicates genotype The maximum likelihood probability;
[0124] If the condition is met, the corresponding cluster belongs to the second type of error. The preliminary genotype of the corresponding cluster is retained, and steps three and four are executed.
[0125] If the conditions are not met, proceed to step three.
[0126] The second error arises from insufficient features supporting the SV, leading to cases where genotype probabilities closer to heterozygous variants are misjudged as having no variant. In such cases, the distribution of features within the pedigree at the SV locus is used to determine whether the structural variant is considered a high-confidence variant. This second type of error occurs when a member of a family has no variant, while other family members are identified as heterozygous or homozygous at the SV locus.
[0127] Step 3: Based on the preliminary genotypes of each member and each cluster in each family, determine whether the preliminary genotype of each cluster belongs to the third type of error. If it belongs to the third type of error, correct the preliminary genotype of the cluster to obtain the final three genotypes for each cluster. , and Triple probability tuples Based on the final three genotypes of each cluster , and Triple probability tuples Proceed to steps three and four; if it does not belong to the third type of error, then the preliminary genotype of the cluster is correct, and proceed to steps three and four.
[0128] The specific process is as follows:
[0129] Determine if each cluster satisfies the following formula:
[0130]
[0131] in,
[0132] This represents the number of structurally variable genes within a certain cluster in the first member of family A;
[0133] This indicates the number of structural variant genes preset by the user;
[0134] This indicates the adjustment percentage for the minimum number of supports, used to retain more variants;
[0135] This indicates that the other two members of family A have no variation within the same cluster;
[0136] Indicates genotype The maximum likelihood probability; Indicates genotype The maximum likelihood probability; Indicates genotype The maximum likelihood probability;
[0137] If the condition is met, the corresponding cluster belongs to the third type of error. The initial genotype of the cluster is corrected to obtain the final three genotypes for each cluster. , and Triple probability tuples Based on the final three genotypes of each cluster , and Triple probability tuples Perform steps three and four;
[0138] If the condition is not met, then there is no error in the corresponding cluster, and steps three and four are executed.
[0139] The third error is that even when the number of supporting SV signals exceeds a minimum threshold, the genotype probability remains close to no variation. This problem may be caused by sequencing imbalance or alignment errors. This technique calculates the pedigree likelihood values of structural variant sites to form correction parameters, and corrects the genotype by integrating the likelihood values of deletion variants and the correction parameters:
[0140] Steps three and four
[0141] Based on the final three genotypes of each member of each family and each cluster. , and Triple probability tuples ,Will Convert to binary probability tuples of the occurrence of SV on two haploids For binary probability tuples Solving for the probability tuples of the offspring yields a binary probability set.
[0142] This indicates the probability of SV appearing in the first haplotype. This indicates the probability of SV appearing on the second haplotype.
[0143] The other steps and parameters are the same as those in one of the specific implementation methods one to six.
[0144] Specific Implementation Method Eight: This implementation method differs from Specific Implementation Methods One through Seven in that, in step three, if the condition is met, the corresponding cluster belongs to the third type of error. The preliminary genotype of the cluster is corrected to obtain the final three genotypes for each cluster. , and Triple probability tuples The process is as follows:
[0145]
[0146] in, Indicates the corrected genotype The maximum likelihood probability; Indicates the corrected genotype The maximum likelihood probability; Indicates the corrected genotype The maximum likelihood probability;
[0147] This is the correction ratio, typically 0.5;
[0148] Triple probability tuple It is the maximum likelihood value of the genotype before correction;
[0149] Triple probability tuple It is the maximum likelihood value of the family genotype;
[0150] The maximum likelihood value of family genotypes is calculated as follows:
[0151]
[0152] in, Indicates family genotype The maximum likelihood probability; Indicates family genotype The maximum likelihood probability; Indicates family genotype The maximum likelihood probability;
[0153] This indicates the sum of the number of sequencing fragments supporting the reference allele among the three family members;
[0154] This represents the sum of the number of sequencing fragments that support the alternative alleles among the three family members;
[0155] This represents the probability (a set value) that a sequencing fragment is incorrectly matched to a haploid.
[0156] The other steps and parameters are the same as those in any of the specific implementation methods one to seven.
[0157] Specific Implementation Method Nine: This implementation method differs from Specific Implementation Methods One through Eight in that, in steps three and four...
[0158] Based on the final three genotypes of each member of each family and each cluster. , and Triple probability tuples ,Will Convert to binary probability tuples of the occurrence of SV on two haploids For binary probability tuples Solving for the probability tuples of the offspring yields a binary probability set.
[0159] This indicates the probability of SV appearing in the first haplotype. This indicates the probability of SV appearing on the second haplotype;
[0160] The specific process is as follows:
[0161] Step 341: Based on the three final genotypes of each family's offspring members in each cluster. , and Triple probability tuples ,Will Convert to a system of two equations;
[0162] The system of two equations is expressed as:
[0163]
[0164] Step 3.2: Solve the system of two equations;
[0165] If a solution exists, obtain a binary probability tuple of the occurrence of SV on the two haploids. ;
[0166] If there is no solution, proceed to step three-four-three;
[0167] Step 3: Bayes' theorem is introduced to convert the ternary probability tuple of the offspring into a binary probability tuple, and the binary probability tuple is solved to obtain the binary probability tuple of the offspring.
[0168] The specific process is as follows:
[0169] Assumption It is the SV inherited by the offspring from their father;
[0170] Assumption It is the SV that the offspring inherit from the mother;
[0171] It is generally believed and It is independent;
[0172] Regarding inheritance from father to offspring ,calculate The expression is:
[0173]
[0174]
[0175]
[0176]
[0177]
[0178]
[0179]
[0180] in, This represents the probability of inheriting an SV from its parent, given the current sequencing data features as a probability condition. This represents the current sequencing data feature (the sequencing data feature is a probability value calculated in a probabilistic manner, representing the number of signals supporting and not supporting mutations at this SV). Used by three family members The calculations for the three family members were performed separately earlier. ); This indicates the probability of exhibiting the current sequencing data characteristics and inheriting the SV from its parent. This indicates the probability of the current sequencing data feature appearing; This indicates the probability that the current sequencing data features are present and that the SV is inherited from both the mother and father; This indicates the probability that the current sequencing data features are present and that the SV is inherited from the father but not from the mother. This represents the probability of the current sequencing data characteristics occurring given the probability of inheriting SV from both the mother and father. This represents the probability of inheriting a mutation from the father, obtained through the probability of the father's genotype. This represents the probability of inheriting a mutation from the mother, obtained through the probability of the mother's genotype. This represents the probability of the current sequencing data feature occurring given the probability of inheriting the SV from the father but not from the mother. This indicates that the mutation is not inherited from the mother; This represents the probability of not inheriting a mutation from the mother, obtained through the probability of the mother's genotype. This indicates the probability that a sequencing fragment is incorrectly aligned to a haploid. Indicates the final genotype of each cluster in the offspring. The probability, Indicates the final genotype of each cluster in the offspring. The probability, Indicates the final genotype of each cluster in the offspring. The probability, This represents the final genotype of each cluster of the father. The probability, This represents the final genotype of each cluster of the father. The probability, This represents the final genotype of each cluster of the mother. The probability, This represents the final genotype of each cluster of the mother. The probability, This represents the final genotype of each cluster of the mother. The probability of; Represents the number of combinations;
[0181] Regarding inheritance from father to offspring ,calculate The expression is:
[0182]
[0183]
[0184]
[0185]
[0186]
[0187]
[0188]
[0189] in, This represents the probability of inheriting an SV from the mother, given the current sequencing data characteristics as a probability condition. This represents the current sequencing data feature (the sequencing data feature is a probability value calculated in a probabilistic manner, representing the number of signals supporting and not supporting mutations at this SV). Used by three family members The calculations for the three family members were performed separately earlier. ); This indicates the probability of exhibiting the current sequencing data characteristics and inheriting the SV from the mother; This indicates the probability of the current sequencing data feature appearing; This indicates the probability that the current sequencing data features are present and that the SV is inherited from both the mother and father; This indicates the probability that the current sequencing data features are present and that the SV is inherited from the mother but not from the father. This represents the probability of the current sequencing data characteristics occurring given the probability of inheriting SV from both the mother and father. This represents the probability of inheriting a mutation from the father, obtained through the probability of the father's genotype. This represents the probability of inheriting a mutation from the mother, obtained through the probability of the mother's genotype. This represents the probability of the current sequencing data feature occurring given the probability of inheriting the SV from the mother but not from the father. This indicates that the mutation is not inherited from the father; This represents the probability of not inheriting a mutation from the father, obtained through the probability of the father's genotype. This indicates the probability that a sequencing fragment is incorrectly aligned to a haploid. This represents the final genotype of each cluster of the father. The probability of;
[0190] based on , Obtain the binary probability tuples of the occurrence of SV on two haplotypes. ;
[0191] in, , .
[0192] The other steps and parameters are the same as those in one of the specific implementation methods one to eight.
[0193] Specific Implementation Method 10: This implementation method differs from one of the specific implementation methods one to nine in that, in step four, the binary probability tuples of the offspring obtained in step three are given a first handshake to obtain each anchoring variant in the offspring SV.
[0194] For each anchoring variant in the offspring SV, extend backward as far as possible to obtain the haploid block corresponding to each anchoring variant in the offspring SV (second handshake).
[0195] Based on the haploid block corresponding to each anchored variant in the progeny SV, the range of the haploid block is extended forward as much as possible for each anchored variant in the progeny SV to obtain the final haploid block corresponding to each anchored variant in the progeny SV (the range of the haploid block is extended forward) (third handshake).
[0196] Based on the final haploid block corresponding to each anchored variant in the progeny SV, determine whether there are duplicate sequencing fragments in adjacent haploid blocks; if so, merge adjacent blocks into a new block and retain the duplicate region only once; if not, do not merge them into a new block.
[0197] The two alleles of the non-anchored variant within each block (including the final haploid block and the new block) have the same parental origin as the two alleles of the anchored variant;
[0198] The specific process is as follows:
[0199] Step 4: Perform the first handshake on the binary probability tuples of the offspring obtained in Step 3 to obtain each anchored mutation in the offspring SV.
[0200] The specific process is as follows:
[0201] Based on the binary probability tuples of the offspring obtained in step three, all offspring SVs were screened for variants that could directly determine the parental origin of the two alleles.
[0202] For example, at an SV locus, the father is homozygous, the mother has no variation, and the offspring are heterozygous. Therefore, at this locus, the sequencing fragment with the variation characteristic must have originated from a haploid inherited from the father, while the sequencing fragment without the variation characteristic must have originated from a haploid inherited from the mother.
[0203] Each progeny SV that can directly determine the origin of the two allele parents is used as an anchoring variant;
[0204] At the same time, the sequencing fragment to which each progeny SV belongs, which can directly determine the origin of the two allele parents, is assigned to the corresponding haploid;
[0205] Step 4.2: For each anchoring variant in the offspring SV, extend backward as much as possible to obtain the haploid block corresponding to each anchoring variant in the offspring SV (second handshake).
[0206] The specific process is as follows:
[0207] Step 421, Order , The index number representing the anchored variation;
[0208] make , Indicates the index number of the non-anchored SV;
[0209] Step 422, from the first Anchoring variants begin to connect backwards. For each non-anchored SV, determine whether the following formula is satisfied;
[0210] If the following formula is not satisfied, then stop applying the first... The non-anchored SV connection in the first After the anchoring mutation, we obtain the target for the first... One anchored variant haploid block;
[0211] If the following formula is satisfied, then the first... The non-anchored SV connection in the first After the anchoring mutation is completed, proceed to steps four, two, and three;
[0212] Step 423, Order ;
[0213] Step 424, from the first Anchoring variants begin to connect backwards. For each non-anchored SV, determine whether the following formula is satisfied;
[0214] If the following formula is not satisfied, then stop applying the first... The non-anchored SV connection in the first After the first non-anchored mutation, we obtain the result for the first... One anchored variant haploid block;
[0215] If the following formula is satisfied, then the first... The non-anchored SV connection in the first After one non-anchored variant, repeat steps 423 to 424 until the [number missing]th variant. After the first anchored mutation is completed and all non-anchored SVs are evaluated, the result is obtained for the first anchored mutation. One anchored variant haploid block;
[0216] Step 425, Order Repeat steps 422 to 425 until all haploid blocks of anchored variants are obtained;
[0217]
[0218] in,
[0219] Indicates the first Each anchored variant corresponds to a haploid block;
[0220] Represents the extended first Each anchored variant corresponds to a haploid block;
[0221] Indicates the first The set of sequencing fragments that support the substitution allele in the haploid block corresponding to each anchored variant;
[0222] Represents the extended first The set of sequencing fragments that support the substitution allele in the haploid block corresponding to each anchored variant;
[0223] Indicates the first Information on the mutated gene, the first Information about each variant gene includes its genotype, length, and location;
[0224] hour, express All sequencing fragments supporting the reference allele within the span;
[0225] hour, express All sequencing fragments within the span that support alternative alleles;
[0226] This invention takes each anchored variant as a starting point and utilizes the coverage relationship of sequencing fragments between different variants to extend the range of each block as far as possible.
[0227] Step 43: Based on the haploid blocks corresponding to each anchored variant in the offspring SV obtained in Step 42, extend the range forward as much as possible for each anchored variant in the offspring SV to obtain the final haploid block corresponding to each anchored variant in the offspring SV (the range of the haploid block is extended forward) (third handshake); the specific process is as follows:
[0228] Step 431, Order , The index number representing the anchored variation;
[0229] make , Indicates the index number of the non-anchored SV;
[0230] Step 432, from the first An anchoring variant begins forward connection. For each non-anchored SV, determine whether the following formula is satisfied;
[0231] If the following formula is not satisfied, then stop applying the first... The non-anchored SV connection in the first Before the anchoring mutation, we obtain the target for the first... One anchored variant haploid block;
[0232] If the following formula is satisfied, then the first... The non-anchored SV connection in the first Before each anchoring mutation, execute steps four, two, and three;
[0233] Step 433, Order ;
[0234] Step 434, from the first An anchoring variant begins forward connection. For each non-anchored SV, determine whether the following formula is satisfied;
[0235] If the following formula is not satisfied, then stop applying the first... The non-anchored SV connection in the first Before the first non-anchored mutation, we obtain the result for the first... One anchored variant haploid block;
[0236] If the following formula is satisfied, then the first... The non-anchored SV connection in the first Before the first non-anchored variant, repeat steps 433 to 434 until the first... After all non-anchored SVs have been evaluated before the first anchored mutation, the result is obtained for the first... The final haploid block of the anchored variant;
[0237] Steps four, three, and five: Order Repeat steps 432 to 435 until the final haploid blocks of all anchored variants are obtained;
[0238]
[0239] in,
[0240] Indicates the first Each anchored variant corresponds to a haploid block;
[0241] Represents the extended first Each anchored variant corresponds to a haploid block;
[0242] It is the first The set of sequencing fragments that support the substitution allele in the haploid block corresponding to each anchored variant;
[0243] Represents the extended first The set of sequencing fragments that support the substitution allele in the haploid block corresponding to each anchored variant;
[0244] Indicates the first Information on the mutated gene, the first Information about each variant gene includes its genotype, length, and location;
[0245] hour, express All sequencing fragments supporting the reference allele within the span;
[0246] hour, express All sequencing fragments within the span that support alternative alleles;
[0247] This invention takes each anchored variant as a starting point and utilizes the coverage relationship of sequencing fragments between different variants to extend the range of each block as far as possible.
[0248] Step Four Four
[0249] Based on the final haploid blocks corresponding to each anchored variant in the progeny SV obtained in step 43, determine whether there are duplicate sequencing fragments in adjacent haploid blocks; if so, merge adjacent blocks into a new block and retain the duplicate region only once; if not, do not merge them into a new block.
[0250] The two alleles of the non-anchored variant within each block (including the final haploid block and the new block) have the same parental origin as the two alleles of the anchored variant;
[0251] Even after using the three-way handshake method, some undetermined SVs may still exist. These SVs cannot be identified by Mendelian laws of inheritance, nor can they be linked to existing blocks using sequencing fragment linking information. These variants can be haplotyped using the nearest-match method, which attempts to assign different SVs to two parental haplotypes, minimizing the difference between their probabilities and those of existing genes in the parents.
[0252] The process of haplotype typing of parents using the three-way handshake method is largely the same as that of haplotype typing of offspring. The only difference is that, in determining the anchor SV, Mendel's laws of inheritance are used to determine which allele must come from the haploid inherited by the offspring, while the other allele belongs to the haploid that was not inherited by the offspring.
[0253] The other steps and parameters are the same as those in any of the specific implementation methods one to nine.
[0254] This invention may have other embodiments. Without departing from the spirit and essence of this invention, those skilled in the art can make various corresponding changes and modifications according to this invention, but these corresponding changes and modifications should all fall within the protection scope of the appended claims.
Claims
1. A method for structural variant calling and genotyping suitable for long read family sample sequencing, characterized in that: The specific process of the method is as follows: Step 1: Obtain the genome sequences of three members of each family; compare the genome sequence of each member with the human reference genome to obtain the genes with structural variations and the genes without variations for each member; Genes with structural variations include the length, location, variation type, and origin of the variant gene; genes without variations include the length, location, and origin of the gene; the structural variation genes of the three members are placed into the corresponding family structural variation feature set; Step 2: Cluster the structural variation genes in each family based on their location and length in the structural variation feature set to obtain the clustering results for each family; Based on the number of sequencing fragments within each cluster supporting the alternative allele and the number of sequencing fragments within each cluster supporting the reference allele for each family, calculate the three genotypes within each cluster for each member of each family , and the respective likelihood probabilities; Select the genotype corresponding to the maximum likelihood probability as the initial genotype for each member of each family and each cluster; Step three, judging whether the preliminary genotype of each cluster belongs to the first error, the second error, the third error or no error based on the preliminary genotype of each cluster of each member of each family, correcting the first error if it belongs to the first error, correcting the second error if it belongs to the second error, correcting the third error if it belongs to the third error, and not correcting if it belongs to no error, obtaining the final three genotypes of each cluster 、 and the ternary probability tuple of ; The final three genotypes for each cluster , and Triple probability tuples Convert to binary probability tuples of the occurrence of SV on two haploids For binary probability tuples Solving for the probability tuples of the offspring yields a binary probability set. This indicates the probability of SV appearing in the first haplotype. This indicates the probability of SV appearing on the second haplotype; Step 4: Perform the first handshake on the binary probability tuples of the offspring obtained in Step 3 to obtain each anchored mutation; For each anchoring variant, extend backward as far as possible to obtain the haploid block corresponding to each anchoring variant; Based on the haploid block corresponding to each anchored variant, the process is extended forward as far as possible for each anchored variant to obtain the final haploid block corresponding to each anchored variant. Based on the final haploid block corresponding to each anchored variant, determine whether there are duplicate sequencing fragments in adjacent haploid blocks; If so, adjacent blocks will be merged into a new block and duplicate areas will only be retained once; If not, then it will not be merged into a new block; The two alleles of the non-anchored variant within each block have the same parental origin as the two alleles of the anchored variant.
2. The method for identifying and genotyping structural variations in long-read family pedigree samples according to claim 1, characterized in that: In step one, the genome sequences of three members of each family are obtained; the genome sequence of each member is compared with the human reference genome to obtain the structurally mutated genes and the unmutated genes of each member. Genes with structural variations include the length, location, variation type, and origin of the variant gene; genes without variations include the length, location, and origin of the gene; the structural variation genes of the three members are placed into the corresponding family structural variation feature set; The specific process is as follows: Step 11: Obtain the genome sequences of the three members of family A; Steps one and two The genome sequence of the father of family A is compared with the human reference genome to obtain the genes with structural variations and the genes without variations in the father. The genome sequence of the mother in family A is compared with the human reference genome to obtain the genes with structural variations and the genes without variations in the mother. The genome sequence of the offspring of family A is compared with the human reference genome to obtain the genes with structural variations and the genes without variations in the offspring. Genes with structural variations include the length, location, type, and origin of the variant gene; Unmutated genes include the gene's length, location, and origin; Record the number of unmutated genes in family A; The number of unmutated genes in family A is the sum of the number of unmutated genes in the father, the number of unmutated genes in the mother, and the number of unmutated genes in the offspring. Step 13 The structural variation genes of the father, the structural variation genes of the mother, and the structural variation genes of the offspring are placed into the structural variation feature set of the Ath family. Record the number of genes with structural variations in the structural variation feature set of family A.
3. The method for identifying and genotyping structural variations in long-read family pedigree samples according to claim 2, characterized in that: In step two, clustering is performed based on the location and length of genes with structural variations in the structural variation feature set of each family to obtain the clustering results for each family. Based on the number of sequencing fragments supporting the alternative allele and the number of sequencing fragments supporting the reference allele within each cluster of each family, the three genotypes within each cluster of each member of each family are calculated. , and The likelihood probabilities of each; Select the genotype corresponding to the maximum likelihood probability as the initial genotype for each member of each family and each cluster; The specific process is as follows: Step 2: First, perform the first clustering based on the location of genes with structural variations in the structural variation feature set of each family to obtain the first clustering results for each family; Step 22: Based on the results of the first clustering, perform a second clustering within each cluster of each family based on the length of each structural variant gene, and obtain the second clustering results; Steps 2 and 3: Based on the results of the second clustering, obtain the number of sequencing fragments supporting alternative alleles and the number of sequencing fragments supporting reference alleles within each cluster; Based on the number of sequencing fragments supporting the alternative allele and the number of sequencing fragments supporting the reference allele within each cluster, the three genotypes within each cluster for each member of each family are calculated. , and The respective maximum likelihood (MLE); genotype This indicates no mutation; Indicates heterozygous variation; Indicates homozygous variation; choose , , The genotype corresponding to the maximum value is used as the initial genotype for each member of each family and each cluster.
4. The method for structural variation identification and genotyping of long-read family sample sequencing according to claim 3, characterized in that: In step two, the first clustering is performed based on the location of genes with structural variations in the structural variation feature set of each family, resulting in the first clustering result for each family; the specific process is as follows: Step Two 11 Collect all the structural variation characteristics of each family Genes with structural variations are arranged from front to back according to their positions; The position of the first structural variant gene after sorting is The location of the second structural variant gene after sorting is The location of the third structural variant gene after sorting is The position of the fourth structural variant gene after sorting is And so on, after sorting, the first... The location of the structurally variable genes is as follows: After sorting, the first The location of the structurally variable genes is as follows: ; Step 212, Order , Indicates the ordinal index; after sorting, the first... Each structurally variable gene is placed in the first cluster; Steps 2, 1, and 3: Compare and sort the results. Location of structurally variable genes And after sorting the first Location of structurally variable genes ; like If the following requirement is met, then The corresponding sorted number The structural variant gene and the sequenced first If the structurally variable genes belong to the same cluster, proceed to step two, one, and four. like If the following requirement is not met, then The corresponding sorted number The structural variant gene and the sequenced first Since the structural variant genes do not belong to the same cluster, construct a new cluster and proceed to step two, one, and four. Step 214, Order Repeat steps two through three until the sorting is complete. The location of each structurally variable gene was used to obtain the first clustering result. Clusters, , This indicates the first clustering result. Clusters.
5. The method for identifying and genotyping structural variations in long-read family pedigree samples according to claim 4, characterized in that: In step two, based on the results of the first clustering, a second clustering is performed within each cluster of each family based on the length of each structural variant gene, to obtain the second clustering results; The specific process is as follows: Step 221, Order , This represents the index number of the cluster in the first clustering result; make , An index representing the structurally variable genes within each cluster; Step 222, compare the first Within the cluster, the first The length of each structural variant gene and the Within the cluster, the first The length of each structural variant gene ; like If the following requirements are met, then proceed to steps two, two, and three; like If the following requirement is not met, then The corresponding structural variant genes all the way up to the 1st The last structurally variable gene within a cluster belongs to a new cluster; Steps two, two, three, command Repeat step two until the first step is completed. All structurally variable genes within a cluster are used to obtain the target gene for the first cluster. A new cluster of clusters; Step 224, Order Repeat steps 222 to 223 until all clusters have been identified, and obtain the second clustering result.
6. The method for structural variation identification and genotyping of long-read family sample sequencing according to claim 5, characterized in that: In steps two and three, the number of sequencing fragments supporting alternative alleles and the number of sequencing fragments supporting reference alleles within each cluster are obtained based on the results of the second clustering. Based on the number of sequencing fragments supporting the alternative allele and the number of sequencing fragments supporting the reference allele within each cluster, the three genotypes within each cluster for each member of each family are calculated. , and The maximum likelihood probabilities of each; genotype This indicates no mutation; Indicates heterozygous variation; Indicates homozygous variation; choose , , The genotype corresponding to the maximum value is used as the initial genotype for each member of each family and each cluster; The specific process is as follows: Step 231: Based on the results of the second clustering, obtain the number of sequencing fragments supporting alternative alleles and the number of sequencing fragments supporting reference alleles within each cluster; Based on the number of sequencing fragments supporting the alternative allele and the number of sequencing fragments supporting the reference allele within each cluster, the three genotypes within each cluster for each member of each family are calculated. , and The maximum likelihood probabilities of each; The specific process is as follows: The number of sequencing fragments supporting alternative alleles within each cluster is equal to the total number of structurally variable genes within the cluster. The number of sequencing fragments supporting the reference allele within each cluster is equal to the total number of structurally unvariated genes within the cluster. Calculate the three genotypes within each cluster for each member of each family. , and The maximum likelihood probabilities of each are expressed as: in, Indicates genotype The maximum likelihood probability; Indicates genotype The maximum likelihood probability; Indicates genotype The maximum likelihood probability; This indicates the probability that a sequencing fragment is incorrectly aligned to a haploid. Indicates the number of sequencing fragments that support alternative alleles; Indicates the number of sequencing fragments supporting the reference allele; Step Two Three Two, Select , , The genotype corresponding to the maximum value is used as the initial genotype for each member of each family and each cluster.
7. The method for structural variant identification and genotyping of long-read family sample sequencing according to claim 6, characterized in that: In step three, based on the preliminary genotype of each member and each cluster in each family, it is determined whether the preliminary genotype of each cluster belongs to the first type of error, the second type of error, the third type of error, or no error. If it belongs to the first type of error, the first type of error is corrected; if it belongs to the second type of error, the second type of error is corrected; if it belongs to the third type of error, the third type of error is corrected; if it belongs to no error, no correction is made, thus obtaining the final three types of genotypes for each cluster. , and Triple probability tuples ; The final three genotypes for each cluster , and Triple probability tuples Convert to binary probability tuples of the occurrence of SV on two haploids For binary probability tuples Solving for the probability tuples of the offspring yields a binary probability set. This indicates the probability of SV appearing in the first haplotype. This indicates the probability of SV appearing on the second haplotype; The specific process is as follows: Step 3: Based on the preliminary genotypes of each member and each cluster in each family, determine whether the preliminary genotype of each cluster belongs to the first type of error. If it belongs to the first type of error, correct the preliminary genotype of the cluster to obtain the final three genotypes for each cluster. , and Triple probability tuples Based on the final three genotypes of each cluster , and Triple probability tuples Proceed to steps three and four; if the error does not fall under the first category, proceed to step three and two. The specific process is as follows: Determine if each cluster satisfies the following formula: in, This indicates the number of genes in a specific cluster for a particular member of each family; This indicates that the number of sequencing fragments supporting the reference allele in a certain cluster for the first member of family A is 0, and the number of sequencing fragments supporting the alternative allele in the cluster is also 0. This represents the average length of all structurally variable genes within the cluster corresponding to the second member of family A; This represents the average length of all structurally variable genes within the cluster corresponding to the 3rd member of family A; If the condition is met, the corresponding cluster belongs to the first type of error, and the initial genotype of the corresponding cluster is modified to the correct genotype. Proceed to steps three and four; If the condition is not met, proceed to step three two. Step 3.2: Based on the preliminary genotypes of each member and each cluster in each family, determine whether the preliminary genotype of each cluster belongs to the second type of error. If it belongs to the second type of error, correct the preliminary genotype of the cluster to obtain the final three genotypes for each cluster. , and Triple probability tuples Based on the final three genotypes of each cluster , and Triple probability tuples Proceed to steps three and four; if the error does not fall under the second category, proceed to step three and three. The specific process is as follows: Determine if each cluster satisfies the following formula: in, This represents the number of structurally variable genes within a certain cluster in the first member of family A; This indicates the number of structural variant genes preset by the user; This indicates that the first member of pedigree A has no variation within a certain cluster; This indicates that the other two members in family A are identified as heterozygous or homozygous at the SV locus in the same cluster; Indicates genotype The maximum likelihood probability; Indicates genotype The maximum likelihood probability; Indicates genotype The maximum likelihood probability; If the condition is met, the corresponding cluster belongs to the second type of error. The preliminary genotype of the corresponding cluster is retained, and steps three and four are executed. If the conditions are not met, proceed to step three. Step 3: Based on the preliminary genotypes of each member and each cluster in each family, determine whether the preliminary genotype of each cluster belongs to the third type of error. If it belongs to the third type of error, correct the preliminary genotype of the cluster to obtain the final three genotypes for each cluster. , and Triple probability tuples Based on the final three genotypes of each cluster , and Triple probability tuples Proceed to steps three and four; if it does not belong to the third type of error, then the preliminary genotype of the cluster is correct, and proceed to steps three and four. The specific process is as follows: Determine if each cluster satisfies the following formula: in, This represents the number of structurally variable genes within a certain cluster in the first member of family A; This indicates the number of structural variant genes preset by the user; Indicates the adjustment ratio for the minimum number of supports; This indicates that the other two members of family A have no variation within the same cluster; Indicates genotype The maximum likelihood probability; Indicates genotype The maximum likelihood probability; Indicates genotype The maximum likelihood probability; If the condition is met, the corresponding cluster belongs to the third type of error. The initial genotype of the cluster is corrected to obtain the final three genotypes for each cluster. , and Triple probability tuples Based on the final three genotypes of each cluster , and Triple probability tuples Perform steps three and four; If the condition is not met, then there is no error in the corresponding cluster, and steps three and four are executed. Steps three and four Based on the final three genotypes of each member of each family and each cluster. , and Triple probability tuples ,Will Convert to binary probability tuples of the occurrence of SV on two haploids For binary probability tuples Solving for the probability tuples of the offspring yields a binary probability set. This indicates the probability of SV appearing in the first haplotype. This indicates the probability of SV appearing on the second haplotype.
8. The method for identifying and genotyping structural variations in long-read family pedigree samples according to claim 7, characterized in that: If the condition in step three is met, the corresponding cluster belongs to the third type of error. The preliminary genotype of the cluster is then corrected to obtain the final three genotypes for each cluster. , and Triple probability tuples The process is as follows: in, Indicates the corrected genotype The maximum likelihood probability; Indicates the corrected genotype The maximum likelihood probability; Indicates the corrected genotype The maximum likelihood probability; It is a correction ratio; Triple probability tuple It is the maximum likelihood value of the genotype before correction; Triple probability tuple It is the maximum likelihood value of the family genotype; The maximum likelihood value of family genotypes is calculated as follows: in, Indicates family genotype The maximum likelihood probability; Indicates family genotype The maximum likelihood probability; Indicates family genotype The maximum likelihood probability; This indicates the sum of the number of sequencing fragments supporting the reference allele among the three family members; This represents the sum of the number of sequencing fragments that support the alternative alleles among the three family members; This indicates the probability that a sequencing fragment is incorrectly matched to a haploid.
9. The method for identifying and genotyping structural variations in long-read family pedigree samples according to claim 8, characterized in that: In steps three and four, the final three genotypes are based on each member of each family and each cluster. , and Triple probability tuples ,Will Convert to binary probability tuples of the occurrence of SV on two haploids For binary probability tuples Solving for the probability tuples of the offspring yields a binary probability set. This indicates the probability of SV appearing in the first haplotype. This indicates the probability of SV appearing on the second haplotype; The specific process is as follows: Step 341: Based on the three final genotypes of each family's offspring members in each cluster. , and Triple probability tuples ,Will Convert to a system of two equations; The system of two equations is expressed as: Step 3-42: Solve the system of two equations; If a solution exists, obtain a binary probability tuple of the occurrence of SV on the two haploids. ; If there is no solution, proceed to step three-four-three; Step 3: Bayes' theorem is introduced to convert the ternary probability tuple of the offspring into a binary probability tuple, and the binary probability tuple is solved to obtain the binary probability tuple of the offspring. The specific process is as follows: Assumption It is the SV inherited by the offspring from their father; Assumption It is the SV inherited by the offspring from the mother; Regarding inheritance from father to offspring ,calculate The expression is: in, This represents the probability of inheriting an SV from its parent, given the current sequencing data features as a probability condition. Indicates the characteristics of the current sequencing data; This indicates the probability of exhibiting the current sequencing data characteristics and inheriting the SV from its parent. This indicates the probability of the current sequencing data feature appearing; This indicates the probability that the current sequencing data features are present and that the SV is inherited from both the mother and father; This indicates the probability that the current sequencing data features are present and that the SV is inherited from the father but not from the mother. This represents the probability of the current sequencing data characteristics occurring given the probability of inheriting SV from both the mother and father. This indicates the probability of inheriting a mutation from the father; This indicates the probability of inheriting a mutation from the mother. This represents the probability of the current sequencing data feature occurring given the probability of inheriting the SV from the father but not from the mother. This indicates that the mutation is not inherited from the mother; This indicates the probability of not inheriting the mutation from the mother; This indicates the probability that a sequencing fragment is incorrectly aligned to a haploid. Indicates the final genotype of each cluster in the offspring. The probability, Indicates the final genotype of each cluster in the offspring. The probability, Indicates the final genotype of each cluster in the offspring. The probability, This represents the final genotype of each cluster of the father. The probability, This represents the final genotype of each cluster of the father. The probability, This represents the final genotype of each cluster of the mother. The probability, This represents the final genotype of each cluster of the mother. The probability, This represents the final genotype of each cluster of the mother. The probability of; Represents the number of combinations; Regarding inheritance from father to offspring ,calculate The expression is: in, This represents the probability of inheriting an SV from the mother, given the current sequencing data characteristics as a probability condition. Indicates the characteristics of the current sequencing data; This indicates the probability of exhibiting the current sequencing data characteristics and inheriting the SV from the mother; This indicates the probability of the current sequencing data feature appearing; This indicates the probability that the current sequencing data features are present and that the SV is inherited from both the mother and father; This indicates the probability that the current sequencing data features are present and that the SV is inherited from the mother but not from the father. This represents the probability of the current sequencing data characteristics occurring given the probability of inheriting SV from both the mother and father. This indicates the probability of inheriting a mutation from the father; This represents the probability of inheriting a mutation from the mother, obtained through the probability of the mother's genotype. This represents the probability of the current sequencing data feature occurring given the probability of inheriting the SV from the mother but not from the father. This indicates that the mutation is not inherited from the father; This indicates the probability of not inheriting the mutation from the father; This indicates the probability that a sequencing fragment is incorrectly aligned to a haploid. This represents the final genotype of each cluster of the father. The probability of; based on , Obtain the binary probability tuples of the occurrence of SV on two haplotypes. ; in, , .
10. The method for identifying and genotyping structural variations in long-read family pedigree samples according to claim 9, characterized in that: In step four, the binary probability tuples of the offspring obtained in step three are used for the first handshake to obtain each anchored mutation in the offspring SV. For each anchoring variant in the offspring SV, extend backward as far as possible to obtain the haploid block corresponding to each anchoring variant in the offspring SV; Based on the haploid blocks corresponding to each anchoring variant in the progeny SV, the process extends forward as far as possible for each anchoring variant in the progeny SV to obtain the final haploid blocks corresponding to each anchoring variant in the progeny SV. Based on the final haploid block corresponding to each anchored variant in the progeny SV, determine whether there are duplicate sequencing fragments in adjacent haploid blocks; If so, adjacent blocks will be merged into a new block and duplicate areas will only be retained once; If not, then it will not be merged into a new block; The two alleles of the non-anchored variant within each block have the same parental origin as the two alleles of the anchored variant; The specific process is as follows: Step 4: Perform the first handshake on the binary probability tuples of the offspring obtained in Step 3 to obtain each anchored mutation in the offspring SV. The specific process is as follows: Based on the binary probability tuples of the offspring obtained in step three, all offspring SVs were screened for variants that could directly determine the parental origin of the two alleles. Each progeny SV that can directly determine the origin of the two allele parents is used as an anchoring variant; At the same time, the sequencing fragment to which each progeny SV belongs, which can directly determine the origin of the two allele parents, is assigned to the corresponding haploid; Step 42: Extend backward as far as possible for each anchoring variant in the offspring SV to obtain the haploid block corresponding to each anchoring variant in the offspring SV; The specific process is as follows: Step 421, Order , The index number representing the anchored variant; make , Indicates the index number of the non-anchored SV; Step 422, from the first Anchoring variants begin to connect backwards. For each non-anchored SV, determine whether the following formula is satisfied; If the following formula is not satisfied, then stop applying the first... The non-anchored SV connection in the first After the anchoring mutation, we obtain the target for the first... One anchored variant haploid block; If the following formula is satisfied, then the first... The non-anchored SV connection in the first After the anchoring mutation is completed, proceed to steps four, two, and three. Step 423, Order ; Step 424, from the first Anchoring variants begin to connect backwards. For each non-anchored SV, determine whether the following formula is satisfied; If the following formula is not satisfied, then stop applying the first... The non-anchored SV connection in the first After the first non-anchored mutation, we obtain the result for the first... One anchored variant haploid block; If the following formula is satisfied, then the first... The non-anchored SV connection in the first After one non-anchored variant, repeat steps 423 to 424 until the [number missing]th variant. After the first anchored mutation is completed and all non-anchored SVs are evaluated, the result is obtained for the first anchored mutation. One anchored variant haploid block; Step 425, Order Repeat steps 422 to 425 until all haploid blocks of anchored variants are obtained; in, Indicates the first Each anchored variant corresponds to a haploid block; Represents the extended first Each anchored variant corresponds to a haploid block; Indicates the first The set of sequencing fragments that support the substitution allele in the haploid block corresponding to each anchored variant; Represents the extended first The set of sequencing fragments that support the substitution allele in the haploid block corresponding to each anchored variant; Indicates the first Information on the mutated gene, the first Information about each variant gene includes its genotype, length, and location; hour, express All sequencing fragments supporting the reference allele within the span; hour, express All sequencing fragments within the span that support alternative alleles; Step 43: Based on the haploid blocks corresponding to each anchoring variant in the offspring SV obtained in Step 42, extend forward as much as possible for each anchoring variant in the offspring SV to obtain the final haploid blocks corresponding to each anchoring variant in the offspring SV. The specific process is as follows: Step 431, Order , The index number representing the anchored variant; make , Indicates the index number of the non-anchored SV; Step 432, from the first Anchoring variants begin to connect forward. For each non-anchored SV, determine whether the following formula is satisfied; If the following formula is not satisfied, then stop applying the first... The non-anchored SV connection in the first Before the anchoring mutation, we obtain the target for the first... One anchored variant haploid block; If the following formula is satisfied, then the first... The non-anchored SV connection in the first Before each anchoring mutation, execute steps four, two, and three; Step 433, Order ; Step 434, from the first Anchoring variants begin to connect forward. For each non-anchored SV, determine whether the following formula is satisfied; If the following formula is not satisfied, then stop applying the first... The non-anchored SV connection in the first Before the first non-anchored mutation, we obtain the result for the first... One anchored variant haploid block; If the following formula is satisfied, then the first... The non-anchored SV connection in the first Before the first non-anchored variant, repeat steps 433 to 434 until the first... After all non-anchored SVs have been evaluated before the first anchored mutation, the result is obtained for the first... The final haploid block of the anchored variant; Steps four, three, and five: Order Repeat steps 432 to 435 until the final haploid blocks of all anchored variants are obtained; in, Indicates the first Each anchored variant corresponds to a haploid block; Represents the extended first Each anchored variant corresponds to a haploid block; It is the first The set of sequencing fragments that support the substitution allele in the haploid block corresponding to each anchored variant; Represents the extended first The set of sequencing fragments that support the substitution allele in the haploid block corresponding to each anchored variant; Indicates the first Information on the mutated gene, the first Information about each variant gene includes its genotype, length, and location; hour, express All sequencing fragments supporting the reference allele within the span; hour, express All sequencing fragments within the span that support alternative alleles; Step Four Four Based on the final haploid blocks corresponding to each anchored variant in the progeny SV obtained in step 43, determine whether there are duplicate sequencing fragments in adjacent haploid blocks; if so, merge adjacent blocks into a new block and retain the duplicate region only once; if not, do not merge them into a new block. The two alleles of the non-anchored variant within each block have the same parental origin as the two alleles of the anchored variant.