ITD mutation detection method and system based on NGS sequencing data
By analyzing NGS sequencing data, the types of soft-splitting sequencing fragments were separated and statistically analyzed. Thresholds were used to determine the detection results of ITD, which solved the problem of the lack of ITD detection methods in the existing technology and achieved accurate ITD location confirmation and compatibility with multiple analysis workflows.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SHANGHAI TISSUEBANK GENE TECH CO LTD
- Filing Date
- 2026-05-12
- Publication Date
- 2026-06-12
Smart Images

Figure CN122201433A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of biological detection technology, and in particular relates to an ITD mutation detection method and system based on NGS sequencing data. Background Technology
[0002] Internal tandem duplication (ITD) is a type of mutation in which a segment of DNA sequence within a gene or gene fragment is arranged repeatedly in the same direction. This duplication is usually located in the coding or regulatory regions of a gene and can alter the gene's reading frame, leading to abnormal protein structures or affecting gene expression regulation. Particularly in cancer and other genetic diseases, it often results in significant changes in protein structure and function. The length and location of ITDs can vary, but their core characteristic is the internal repetitive sequence.
[0003] The formation mechanisms of ITDs include: DNA replication slippage, where DNA polymerase slips along the template strand during DNA replication, leading to the insertion of repetitive sequences; non-homologous end joining (NHEJ), where double-strand breaks are joined through error repair pathways during DNA repair, resulting in the insertion of repetitive sequences; and unequal crossing over, where unequal crossing over of chromosomes during meiosis leads to the generation of repetitive sequences. Based on the location and function of the repetitive sequences, ITDs can be classified into the following types: coding region ITDs, which are located within the coding region (exon) of a gene and typically cause significant changes in protein structure, such as the FLT3-ITD mutation leading to persistent activation of receptor tyrosine kinases; and non-coding region ITDs, which are located within the non-coding region of a gene (such as introns, promoters, or enhancer regions) and may affect transcriptional regulation or RNA processing.
[0004] The role and mechanism of ITD in various diseases, especially cancer, have important clinical significance. For example, in acute myeloid leukemia (AML), FLT3-ITD is one of the most common mutations in AML, leading to abnormal activation of receptor tyrosine kinase, which promotes the proliferation and survival of leukemia cells; in myelodysplastic syndrome (MDS), ITD is found in MDS and affects cell differentiation and proliferation; and in other cancers, such as PDGFRA gene ITD mutations in gastrointestinal stromal tumors (GIST).
[0005] Therefore, ITD has important clinical significance. ITD can serve as a diagnostic marker for certain diseases, such as FLT3-ITD, which is an important molecular marker for AML. ITD is often associated with the prognosis of the disease, such as FLT3-ITD, which predicts a poor prognosis in AML. Gene products with ITD mutations can be used as targets for targeted therapy, such as FLT3 inhibitors, which are used to treat AML patients with FLT3-ITD mutations.
[0006] However, there is currently no ITD detection method based on NGS sequencing data specifically for multiple gene mutations. Summary of the Invention
[0007] The purpose of this invention is to provide an ITD mutation detection method based on NGS sequencing data, which aims to solve the problems mentioned in the background art.
[0008] The present invention is implemented as follows: an ITD mutation detection method based on NGS sequencing data includes the following steps: Includes the following steps: The NGS sequencing data of the sample to be tested were compared with the reference genome sequence, and multiple soft-splitting sequencing fragments were extracted. Based on the characteristics of ITD in sequencing data, soft-splitting sequencing fragments are classified into initiation type, insertion type, or termination type; Determine the start and / or end point of each ITD based on the type of soft-splitting sequencing fragment; Count the number of all supporting sequencing fragments and their corresponding types for each ITD; The test result for each ITD is determined based on a preset threshold, the number of all supporting sequencing fragments for each ITD, and their corresponding types. The result is output as negative, positive, or indeterminate.
[0009] Preferably, in the step of determining the start and / or end point of each ITD according to the type of soft-splitting sequencing fragment, the start type generates a breakpoint of the soft-splitting sequencing fragment indicating the end point of the ITD; The insert type generates breakpoints for soft-splitting sequencing fragments, indicating the start and end points of the ITD; The terminology used to generate breakpoints in the soft-splitting sequencing fragments indicates the start point of the ITD.
[0010] Preferably, the step of determining whether the test result for each ITD is negative, positive, or indeterminate based on a preset threshold, the number of all supporting sequencing fragments for each ITD, and their corresponding types specifically includes: Determine whether the number of supporting sequencing fragments for each ITD is higher than the basic support fragment count threshold. If so, set the corresponding ITD as a candidate ITD; otherwise, output the test result of the corresponding ITD as negative. For each candidate ITD, the supporting sequencing fragments are matched by type. If a candidate ITD is supported by two or more types of supporting sequencing fragments and is matched, the candidate ITD is set as a paired ITD and the test result of the paired ITD is output as positive. If a candidate ITD is supported by one type of supporting sequencing fragment and cannot be matched, the candidate ITD is set as an isolated ITD. Determine whether the number of supporting sequencing fragments in the isolated evidence ITD is greater than the uncertainty threshold; When the number of supporting sequencing fragments in an isolated ITD is greater than the uncertainty threshold, it is further determined whether the number of supporting sequencing fragments in the isolated ITD is greater than the significance level threshold. If so, the detection result of the isolated ITD is output as positive; otherwise, the detection result of the isolated ITD is output as uncertain. When the number of supporting sequencing fragments possessed by an isolated ITD is less than or equal to the uncertainty threshold, it is then determined whether the type of supporting sequencing fragments possessed by the isolated ITD is insert type. If so, the detection result of the isolated ITD is output as uncertain; otherwise, the detection result of the isolated ITD is output as negative.
[0011] Preferably, the basic support threshold is 10.
[0012] Preferably, the uncertainty determination threshold is 10.
[0013] Preferably, the uncertainty determination threshold is 20.
[0014] Another objective of this invention is to provide an ITD mutation detection system based on NGS sequencing data, for implementing the above method, comprising: The sequence alignment and extraction module is used to compare the NGS sequencing data of the sample to be tested with the reference genome sequence and extract multiple soft-cut sequencing fragments. The sequence classification module is used to classify soft-splitting sequencing fragments into three types—initiation type, insertion type, and termination type—based on the characteristics of ITD in the sequencing data. The ITD start and end point determination module is used to determine the start and / or end point of each ITD based on the type of soft-splitting sequencing fragment. The sequence information statistics module is used to count the number of all supported sequencing fragments and their corresponding types for each ITD; The judgment output module is used to judge based on a preset threshold and the number and corresponding type of the supported sequencing fragments obtained by the sequence information statistics module, and output the detection result of each ITD as negative, positive or uncertain.
[0015] This invention provides an ITD mutation detection method based on NGS sequencing data. Based on NGS sequencing data, it can analyze the ITDs that may exist in the data, and further analyze and confirm the suspected ITD regions and give a more accurate ITD location. Moreover, the system provided by this invention will not affect other data analyses, and the system provided by this invention can be run simultaneously with other analysis software or embedded in multiple analysis workflows. Attached Figure Description
[0016] Figure 1 A flowchart illustrating an ITD mutation detection method based on NGS sequencing data provided in this embodiment of the invention; Figure 2 The schematic diagram of the principle of locating the start and end points of the ITD provided in the embodiment of the present invention is shown in Figure A, which represents the starting principle of generating the 3' end soft shear, Figure B represents the insertion principle of generating the insertion, and Figure C represents the ending principle of generating the 5' end soft shear. Figure 3 SNP linkage diagram provided for embodiments of the present invention; Figure 4 This is a structural block diagram of an ITD mutation detection system based on NGS sequencing data, provided in an embodiment of the present invention. Detailed Implementation
[0017] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the invention.
[0018] The specific implementation of the present invention will be described in detail below with reference to specific embodiments.
[0019] like Figure 1 The diagram shown is a flowchart of an ITD mutation detection method based on NGS sequencing data, provided as an embodiment of the present invention, including the following steps: Step S1: Compare the NGS sequencing data of the sample to be tested with the reference genome sequence, and extract multiple soft clip reads: The data generated by NGS consists of sequencing reads. Because the occurrence of ITD causes inconsistency between the ITD region and the reference genome sequence, when the sequencing reads carrying the ITD are aligned with the reference genome sequence, a large number of soft clip reads are generated. Therefore, the presence of large soft clip reads in a certain region indicates that ITD has occurred there. Step S2: Based on the characteristics of ITD in the sequencing data, soft clip reads are classified into start, insert, or end types. Based on different situations, the sequencing reads generated by the ITD are classified. For sequencing reads falling at the boundary between the two components of the ITD, when aligned to the human genome reference sequence (hg19), several possibilities exist for forming softclip reads (e.g., ...). Figure 2 (As shown in A and C); for sequencing reads that completely cover both components of the ITD, several possible insertions can be formed when aligned to hg19 (e.g., ... Figure 2 As shown in Figure B), these possibilities are named into three types: start type (e.g., ... Figure 2 As shown in A), insert type (such as...) Figure 2 As shown in B), end type (such as...) Figure 2 (as shown in C). For start-type sequencing, the breakpoint that produces soft clip reads can indicate an endpoint of the ITD (e.g., Figure 2 As shown in ② of A), this is because of the soft clip reads (yellow, as shown in ② of A). Figure 2 As shown in ① of section A, this corresponds to the front end of the second component of the ITD. Since these two components are completely identical, this soft-clip read can necessarily align to the front end of the first component (e.g., Figure 2 As shown in ③ of A), the endpoint of ITD can be determined. For insert type, since the ITD is an extra replicated and doubled segment, it appears as an insert after being aligned to the human genome reference (hg19). The start and end points can be found by simply aligning the inserted part near the insertion point. For end-type sequencing, the breakpoints that produce soft clip reads can indicate a starting point for the ITD (e.g., Figure 2As shown in ② of C), this is because of the soft clip reads (green, as shown in ② of C), which are sequenced by soft clips. Figure 2 As shown in ① of C, this corresponds to the rear end of the first component of the ITD, and since these two components are completely identical, this soft clip read can necessarily align to the rear end of the second component (e.g., Figure 2 As shown in ③ of C, the starting point of ITD can be determined from this. Step S3: Determine the start and / or end point of each ITD based on the type of soft clip reads; Step S4: Count the number of all supporting sequencing reads and their corresponding types for each ITD; Step S5: Based on the preset threshold and the number and type of supporting sequencing reads obtained in Step S4, determine the result of each ITD as negative, positive, or indeterminate. Figure 3 As shown, it includes the following sub-steps: Step S5-1: Determine whether the number of supporting sequencing reads for each ITD is higher than the base level threshold. If so, set the corresponding ITD as a candidate ITD; otherwise, output the test result of the corresponding ITD as negative. Specifically, each sequencing read can identify the start and end points of an ITD, which can be considered a candidate ITD. The sequencing read can be considered as evidence of this candidate ITD. Next, all candidate ITDs are traversed, and the number of evidences (the number of supporting sequencing reads for this candidate ITD) is counted. The more evidence a candidate ITD has, the more likely it is to be a real ITD. In this embodiment, the base level threshold is 10. Step S5-2: For each candidate ITD, perform type matching of supporting sequencing reads. If a candidate ITD is supported by two or more types of supporting sequencing reads and is matched, the candidate ITD is set as a paired ITD, and the test result of the paired ITD is output as positive. If a candidate ITD is supported by only one type of supporting sequencing read and cannot be matched, the candidate ITD is set as an isolated ITD, and proceed to step S5-3. Specifically, sequencing reads randomly cover various locations of the ITD, and all three types of evidence will appear. If two or more different types of evidence point to the same ITD, then it is highly likely that this location is the true ITD. In addition, due to accidental sequence similarity, when the alignment of the target ITD only has the head or the tail, the resulting match often only has one-sided evidence, which is recorded in the result file as only start or only end. If there is only one-sided evidence, it is very likely that it is not the real ITD. Step S5-3: Determine whether the number of supporting sequencing reads in the isolated evidence ITD is greater than the uncertainty threshold. If yes, proceed to step S5-4; otherwise, proceed to step S5-5. In this embodiment, the uncertainty threshold is 10; Step S5-4: Further determine whether the number of supporting sequencing reads of the isolated ITD is greater than the significance level threshold (prominent_level). If so, output the detection result of the isolated ITD as positive; otherwise, output the detection result of the isolated ITD as uncertain. In this embodiment, the significance level threshold is 20; Step S5-5: Further determine whether the type of supporting sequencing reads possessed by the isolated evidence ITD is insert type. If so, output the test result of the isolated evidence ITD as uncertain; otherwise, output the test result of the isolated evidence ITD as negative.
[0020] like Figure 4 The diagram shown illustrates a structural block diagram of an ITD mutation detection system based on NGS sequencing data, according to an embodiment of the present invention, comprising: The sequence alignment and extraction module 100 is used to align the NGS sequencing data of the sample to be tested with the reference genome sequence and extract multiple soft clip reads. The sequence classification module 200 is used to classify softclip reads into three types—start, insert, and end—based on the characteristics of ITDs in the sequencing data. ITD start and end point determination module 300 is used to determine the start and / or end point of each ITD based on the type of soft clip reads; The sequence information statistics module 400 is used to count the number of all supported sequencing reads and their corresponding types for each ITD; The judgment output module 500 is used to judge the detection result of each ITD as negative, positive or uncertain based on the preset threshold and the number and type of the supported sequencing fragments (reads) obtained by the sequence information statistics module.
[0021] The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of the present invention should be included within the protection scope of the present invention.
Claims
1. A method for detecting ITD mutations based on NGS sequencing data, characterized in that, Includes the following steps: The NGS sequencing data of the sample to be tested were compared with the reference genome sequence, and multiple soft-splitting sequencing fragments were extracted. Based on the characteristics of ITD in sequencing data, soft-splitting sequencing fragments are classified into initiation type, insertion type, or termination type; Determine the start and / or end point of each ITD based on the type of soft-splitting sequencing fragment; Count the number of all supporting sequencing fragments and their corresponding types for each ITD; The test result for each ITD is determined based on a preset threshold, the number of all supporting sequencing fragments for each ITD, and their corresponding types. The result is output as negative, positive, or indeterminate.
2. The ITD mutation detection method based on NGS sequencing data according to claim 1, characterized in that, In the step of determining the start and / or end point of each ITD based on the type of soft-splitting sequencing fragment, the start type generates breakpoints for the soft-splitting sequencing fragments to indicate the end point of the ITD. The insert type generates breakpoints for soft-splitting sequencing fragments, indicating the start and end points of the ITD; The terminology used to generate breakpoints in the soft-splitting sequencing fragments indicates the start point of the ITD.
3. The ITD mutation detection method based on NGS sequencing data according to claim 1, characterized in that, Based on a preset threshold, the number and type of all supporting sequencing fragments for each ITD, the steps for outputting the test result for each ITD as negative, positive, or indeterminate include: Determine whether the number of supporting sequencing fragments for each ITD is higher than the basic support fragment count threshold. If so, set the corresponding ITD as a candidate ITD; otherwise, output the test result of the corresponding ITD as negative. For each candidate ITD, the supporting sequencing fragments are matched by type. If a candidate ITD is supported by two or more types of supporting sequencing fragments and is matched, the candidate ITD is set as a paired ITD and the test result of the paired ITD is output as positive. If a candidate ITD is supported by one type of supporting sequencing fragment and cannot be matched, the candidate ITD is set as an isolated ITD. Determine whether the number of supporting sequencing fragments in the isolated evidence ITD is greater than the uncertainty threshold; When the number of supporting sequencing fragments in an isolated ITD is greater than the uncertainty threshold, it is further determined whether the number of supporting sequencing fragments in the isolated ITD is greater than the significance level threshold. If so, the detection result of the isolated ITD is output as positive; otherwise, the detection result of the isolated ITD is output as uncertain. When the number of supporting sequencing fragments possessed by an isolated ITD is less than or equal to the uncertainty threshold, it is then determined whether the type of supporting sequencing fragments possessed by the isolated ITD is insert type. If so, the detection result of the isolated ITD is output as uncertain; otherwise, the detection result of the isolated ITD is output as negative.
4. The ITD mutation detection method based on NGS sequencing data according to claim 3, characterized in that, The basic support threshold is 10.
5. The ITD mutation detection method based on NGS sequencing data according to claim 3, characterized in that, The uncertainty determination threshold is 10.
6. The ITD mutation detection method based on NGS sequencing data according to claim 3, characterized in that, The uncertainty determination threshold is 20.
7. An ITD mutation detection system based on NGS sequencing data, used to implement the method as described in any one of claims 1-6, characterized in that, include: The sequence alignment and extraction module is used to compare the NGS sequencing data of the sample to be tested with the reference genome sequence and extract multiple soft-cut sequencing fragments. The sequence classification module is used to classify soft-splitting sequencing fragments into three types—initiation type, insertion type, and termination type—based on the characteristics of ITD in the sequencing data. The ITD start and end point determination module is used to determine the start and / or end point of each ITD based on the type of soft-splitting sequencing fragment. The sequence information statistics module is used to count the number of all supported sequencing fragments and their corresponding types for each ITD; The judgment output module is used to judge based on a preset threshold and the number and corresponding type of the supported sequencing fragments obtained by the sequence information statistics module, and output the detection result of each ITD as negative, positive or uncertain.