Method and device for detecting hemolysis contamination of a sample
By constructing a cluster analysis model and using insert fragment length analysis to detect gDNA contamination in NGS technology, the problem of signal dilution caused by leukocyte contamination was solved, thus improving the accuracy of early tumor diagnosis.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- GUANGZHOU BURNING ROCK DX CO LTD
- Filing Date
- 2022-12-06
- Publication Date
- 2026-06-23
AI Technical Summary
Existing NGS technology is subject to genomic DNA contamination from leukocyte rupture and release when detecting cfDNA in blood, which dilutes the signal and can lead to screening or diagnostic errors, especially in early tumor diagnosis. Current methods cannot achieve a contamination detection sensitivity of <0.001.
By analyzing the length of the inserted fragments in the sequencing data, a clustering analysis model was constructed. A Gaussian mixture model was used for clustering and discriminant analysis to determine whether gDNA contamination existed in the cfDNA sample.
It achieves highly efficient and sensitive detection of sample hemolysis contamination, which can improve detection accuracy and reduce the error rate in early tumor diagnosis.
Smart Images

Figure QLYQS_1 
Figure QLYQS_4 
Figure QLYQS_7
Abstract
Description
TECHNICAL FIELD
[0001] The present application relates to the field of biotechnology, in particular to a method for detecting sample hemolysis contamination, which monitors the length difference of cfDNA and gDNA, and provides a judgment on whether the sample is contaminated by hemolysis. BACKGROUND
[0002] The rise of next-generation sequencing technology (NGS) is crucial for the detection of analytes and its application in molecular biology and medicine. For example, as the gold standard for methylation sequencing, bisulfite sequencing (BS-seq) has a single-base resolution and high-throughput characteristics, and its role in cancer screening, diagnosis, and monitoring is increasingly recognized.
[0003] For NGS liquid biopsy, there is a risk of contamination of genomic DNA released by white blood cells during the extraction of cfDNA (cell-free DNA) from blood. If genomic DNA is mixed into the subsequent sequencing process, it will dilute the signal and reduce the accuracy of the results. This risk is even more serious in the diagnosis and screening of early tumors, because the tumor component in early tumor blood samples is usually very low (<0.001), and trace contamination of white blood cells can cause screening or diagnostic errors. However, current NGS contamination detection methods often cannot achieve the detection sensitivity of <0.001 contamination ratio.
[0004] The present application monitors the fragment length of cfDNA. The fragments of WBC (white blood cell) are larger in sequencing data (such as bisulfite sequencing) because the fragmentation is not complete, so by analyzing the insert length in the sequencing data, it can be used to monitor whether cfDNA is at risk of being contaminated by WBC. The method disclosed in the present application can achieve high efficiency and high sensitivity in the application of detecting sample hemolysis contamination. SUMMARY
[0005] The present application relates to a method, device, equipment and storage medium for detecting sample hemolysis contamination, which can achieve high efficiency and high sensitivity in the application of detecting sample hemolysis contamination.
[0006] In one aspect, the present application provides a method for detecting genomic DNA (gDNA) contamination in a cell-free DNA (cfDNA) sample, comprising:
[0007] (1) obtaining the insert length of all reads in the sequencing results;
[0008] (2) Construct a cluster analysis model based on the length of the inserted fragment obtained in step (1) and obtain the cluster analysis results;
[0009] (3) Based on the cluster analysis results obtained in step (2), perform discriminant analysis to determine whether gDNA contamination exists in the above cfDNA samples.
[0010] On the other hand, the present invention provides an apparatus for detecting genomic DNA (gDNA) contamination in cell-free plasma DNA (cfDNA) samples, comprising:
[0011] The read length acquisition module is configured to acquire the insert fragment length of all reads in the sequencing results;
[0012] The analysis module is configured to perform cluster analysis on the length of the inserted fragment obtained in step (1);
[0013] The determination module is configured to perform discriminant analysis based on the cluster analysis results obtained in step (2) to determine whether gDNA contamination exists in the above cfDNA samples.
[0014] On the other hand, the present invention provides an apparatus for detecting genomic DNA (gDNA) contamination in cell-free DNA (cfDNA) samples, comprising:
[0015] One or more processors;
[0016] A storage device on which one or more programs are stored;
[0017] When the above one or more programs are executed by the above one or more processors, the above one or more processors implement the method as described above.
[0018] On the other hand, the present invention provides a computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by one or more processors, implements the method as described in any of the above aspects. Attached Figure Description
[0019] The accompanying drawings, which are incorporated in and form part of this specification, illustrate embodiments consistent with this specification and, together with the description, serve to explain the principles of this specification.
[0020] Figure 1 The distribution trend of the inserted segments in the simulated data is shown. Among them, Figure 1 A represents the distribution trend of inserted segments in the background sample. Figure 1 B represents the distribution trend of the inserted fragments when 20% WBC contamination is incorporated. Detailed Implementation
[0021] I. Definition
[0022] In this invention, unless otherwise stated, the scientific and technical terms used herein have the meanings commonly understood by those skilled in the art. Furthermore, the terms and laboratory procedures related to protein and nucleic acid chemistry, molecular biology, cell and tissue culture, microbiology, and immunology used herein are all widely used terms and routine procedures in their respective fields. To better understand this invention, definitions and explanations of relevant terms are provided below.
[0023] As used herein, the terms “free DNA sample” or “cfDNA sample” include blood or plasma samples containing cfDNA obtained from an individual, as well as samples obtained by processing the aforementioned blood or plasma samples in any way to extract cfDNA, such as samples treated with reagents, elution, adsorption, or enrichment of specific types of molecules (e.g., nucleic acid molecules).
[0024] As used herein, the term "read" refers to a set of sequence data describing a fragment of nucleotide sample or reference. A read can refer to a sample read and / or a reference read. Typically, although not always necessary, a read represents a short sequence of consecutive base pairs in a sample or reference. Read lengths can be symbolically represented by the base pair sequence (in ATCG) of a sample or reference fragment. This can be stored in a storage device and appropriately processed to determine if the read matches a reference sequence or meets other criteria. Read lengths can be obtained directly from sequencing equipment or indirectly from stored sequence information associated with the sample. In some cases, a read is a DNA sequence of sufficient length (e.g., at least about 25 bp) to identify larger sequences or regions, such as those that can be compared and specifically assigned to chromosomal or genomic regions or genes.
[0025] Next-generation sequencing methods include, for example, sequencing by synthesis (Illumina), pyrosequencing (454), ion semiconductor technology (ion-fluidic sequencing), single-molecule real-time sequencing (Pacific Biosciences), and ligation-based sequencing (SOLiD sequencing). Depending on the sequencing method, the length of each read can range from approximately 30 bp to greater than 10,000 bp. For example, Illumina sequencing using a SOLiD sequencer produces nucleic acid reads of approximately 50 bp. Ion-fluidic sequencing, for another example, produces nucleic acid reads up to 400 bp, while 454 pyrosequencing produces nucleic acid reads of approximately 700 bp. Single-molecule real-time sequencing, for yet another example, can produce reads of 10,000 to 15,000 bp. Therefore, in some implementations, the length of nucleic acid sequence reads is 30-100 bp, 50-200 bp, or 50-400 bp.
[0026] As used in this article, the term "insert" refers to the DNA fragment that actually comes from the sample being tested, remaining after removing sequences such as adapters and tags added by the sequencer for sequencing purposes from the DNA reads. "Insert size," on the other hand, refers to the length of this DNA fragment from the sample, usually measured in base pairs (bp) or nucleotides (nt).
[0027] II. Detailed Implementation Plan
[0028] In one aspect, the present invention provides a method for detecting genomic DNA (gDNA) contamination in cell-free DNA (cfDNA) samples, comprising:
[0029] (1) Obtain the insert fragment length of all reads in the sequencing results;
[0030] (2) Construct a cluster analysis model based on the length of the inserted fragment obtained in step (1), perform cluster analysis, and obtain the cluster analysis results;
[0031] (3) Based on the cluster analysis results obtained in step (2), perform discriminant analysis to determine whether gDNA contamination exists in the above cfDNA samples.
[0032] In some implementations, the clustering analysis model in step (2) adopts a Gaussian model.
[0033] In some preferred embodiments, the clustering analysis model described above employs a Gaussian mixture model, which is composed of K single Gaussian models.
[0034] In some preferred embodiments, the K setting is selected from 2 or 3.
[0035] In some preferred embodiments, K is set to 3.
[0036] In some implementations, the discriminant analysis in step (3) includes: based on the clustering analysis model constructed in step (2), dividing the Gaussian distributions included in the above clustering analysis model according to their expected values μ. k Sort the data and set the first peak based on the sorting results.
[0037] In some preferred embodiments, the sorting is performed by arranging the Gaussian distributions included in the clustering analysis model according to the expected value μ. k Sort by size from smallest to largest, and calculate the expected value μ based on the sorting result. k The smallest component is set as the first peak.
[0038] In some implementations, the discriminant analysis in step (3) includes: calculating the 90th or 95th percentile of the first peak, and determining whether the cfDNA sample is contaminated by gDNA by judging whether the 90th or 95th percentile of the first peak deviates from a preset threshold.
[0039] In some preferred embodiments, the 90th percentile of the first peak is calculated, and the cfDNA sample is determined to be contaminated with gDNA by judging whether the 90th percentile of the first peak deviates from a preset threshold.
[0040] In some implementations, the preset threshold is selected from 180 to 190.
[0041] In some preferred embodiments, the threshold is 180.
[0042] In some implementations, the discriminant analysis in step (3) includes: calculating the 90th or 95th percentile of the first peak, and determining whether the 90th or 95th percentile of the first peak is higher than 180. If the 90th or 95th percentile of the first peak is higher than 180, the cfDNA sample is determined to be contaminated with gDNA.
[0043] In some preferred embodiments, the 90th percentile of the first peak is calculated, and it is determined whether the 90th percentile of the first peak is higher than 180. If the 90th percentile of the first peak is higher than 180, the cfDNA sample is determined to be contaminated with gDNA.
[0044] In some preferred embodiments, the 95th percentile of the first peak is calculated, and it is determined whether the 95th percentile of the first peak is higher than 190. If the 95th percentile of the first peak is higher than 190, the cfDNA sample is determined to be contaminated with gDNA.
[0045] In some implementations, the aforementioned gDNA contamination originates from ruptured white blood cells in blood samples.
[0046] On the other hand, the present invention provides an apparatus for detecting genomic DNA (gDNA) contamination in cell-free DNA (cfDNA) samples, comprising:
[0047] The read length acquisition module is configured to acquire the insert fragment length of all reads in the sequencing results;
[0048] The analysis module is configured to perform cluster analysis on the length of the inserted fragment obtained in step (1);
[0049] The determination module is configured to perform discriminant analysis based on the cluster analysis results obtained in step (2) to determine whether gDNA contamination exists in the above cfDNA samples.
[0050] In some implementations, the analysis module further includes a clustering unit configured as a Gaussian mixture model, which is composed of K single Gaussian models.
[0051] In some preferred embodiments, the clustering unit is further configured such that the K setting is selected from 2 or 3.
[0052] In some preferred embodiments, the clustering unit is further configured such that K is set to 3.
[0053] In some implementations, the determination module further includes a sorting unit configured to sort the Gaussian distributions included in the clustering analysis model according to their expected values μ. k Sort the data and set the first peak based on the sorting results.
[0054] In some preferred embodiments, the sorting unit is further configured to sort the Gaussian distributions included in the clustering analysis model according to the expected value μ. k Sort by size from smallest to largest, and calculate the expected value μ based on the sorting result. k The smallest component is set as the first peak.
[0055] In some preferred embodiments, the determination module further includes a contamination determination unit configured to calculate the 90th or 95th percentile of the first peak, and determine whether the cfDNA sample is contaminated by gDNA by determining whether the 90th or 95th percentile of the first peak deviates from a preset threshold.
[0056] In some preferred embodiments, the contamination determination unit is further configured to calculate the 90th percentile of the first peak, and determine whether the cfDNA sample is contaminated by gDNA by determining whether the 90th percentile of the first peak deviates from a preset threshold.
[0057] In some preferred embodiments, the contamination determination unit is further configured to calculate the 90th percentile of the first peak, and determine whether the 90th percentile of the first peak is higher than 180. If the 90th percentile of the first peak is higher than 180, the cfDNA sample is determined to be contaminated with gDNA.
[0058] In some preferred embodiments, the contamination determination unit is further configured to calculate the 95th percentile of the first peak, and determine whether the 95th percentile of the first peak is higher than 190. If the 95th percentile of the first peak is higher than 190, the cfDNA sample is determined to be contaminated with gDNA.
[0059] On the other hand, the present invention provides an apparatus for detecting genomic DNA (gDNA) contamination in cell-free DNA (cfDNA) samples, comprising:
[0060] One or more processors;
[0061] A storage device on which one or more programs are stored;
[0062] When the above one or more programs are executed by the above one or more processors, the above one or more processors implement the method as described above.
[0063] On the other hand, the present invention provides a computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by one or more processors, implements the method as described in any of the above aspects.
[0064] For the purpose of clarity and concise description, features are described herein as part of some identical or separate embodiments; however, it will be understood that the scope of the invention may include some embodiments having a combination of all or some of the features described.
[0065] Example
[0066] Experimental Procedure
[0067] The steps for extracting cfDNA from blood can be: The DSP Circulating DNA Kit is used with the accompanying QIAsymphony SP workbench. The process includes the following steps: transfer of sample and binding buffer, proteinase K and magnetic powder to the sample preparation kit; DNA binding to magnetic powder; magnetic separation; washing; magnetic separation again; elution to obtain high-quality, pure cfDNA.
[0068] After extraction, the cfDNA is subjected to bisulfite treatment, followed by ELSA-seq single-strand library construction technology (see Chinese invention patent application publication: CN110892097A) to complete the construction of the cfDNA pre-library. Then, through hybridization capture targeting the pre-library, as well as capture elution and construction of the final library, the cfDNA is finally sequenced using a sequencer (e.g., Illumina platform) to obtain the sequencing results.
[0069] Example 1: Algorithm Construction
[0070] 1. Data Preparation
[0071] The first step of this method is to obtain the insert length of all reads in the cfDNA sequencing results. After aligning the reads to a reference sequence to obtain the alignment result file (BAM file), the `collectinsert size` module of the picard software is used to count the insert length of all reads in the BAM file. The picard output file contains the insert lengths of different values and the corresponding read counts, and the statistical results are used in subsequent modeling and analysis.
[0072] 2. Algorithm Construction
[0073] 2.1 Fragment Classification
[0074] The second step is to perform cluster analysis on the inserted fragment lengths based on the output file from the data preparation. The clustering model is a Gaussian Mixture Model (GMM). A GMM consists of K Gaussian distributions, each of which is a component of the GMM. Since the inserted fragment lengths are one-dimensional data, the Gaussian distribution follows the following probability density function:
[0075]
[0076] The probability density function of the GMM is as follows:
[0077]
[0078] in, 0≦π k ≦1;
[0079] φ(x|θ k Let θ be the Gaussian distribution density function of the k-th component. k =(μ k ,σ 2 k );
[0080] π k This can be viewed as the weights of each Gaussian distribution. In the case of WBC pollution monitoring and detection, a univariate Gaussian distribution is used here, with the initial component K set to 3. Assuming there are N reads with interpolated segment lengths, and each read is independent, then the likelihood function can be used. The maximum likelihood estimation is performed to solve for θ, and the length of the inserted segment is divided into different Gaussian distributions based on the optimization results.
[0081] 2.1 Scoring Model
[0082] Based on the modeling results, the Gaussian distribution is divided according to μ k Values sorted in ascending order, μk The component with the lowest value is designated as the first peak, which is usually the main peak. Based on the mean and variance of the obtained main peak, the 90th quantile value of the main peak can be calculated. By judging whether the 90th quantile value of the main peak deviates from the expected value, it is inferred whether the sample is at risk of WBC contamination. A baseline is established using data from uncontaminated healthy human samples. A threshold is set based on the baseline results. If the 90th quantile value of the main peak of a sample is greater than the threshold set in this invention, it is inferred that it is very likely to be contaminated by WBC.
[0083] Example 2: Simulated Data
[0084] The scenario of WBC contamination in cfDNA sequencing results was artificially simulated. A healthy volunteer sample with a median read insertion length of 147 bp was selected as the baseline sample. An unbroken bisulfite-treated human standard sample NA12878 with a median read insertion length of 204 bp was selected to simulate the long read mixing scenario of WBC contamination. Reads were mixed into the baseline sample at different gradients according to their number, ranging from 0.01% to 20%, with each gradient simulated five times. When the gradient was 20%, the insert distribution was as follows. Figure 1 As shown in B, it can be seen that relative to Figure 1 The background sample shown in A has more large fragments and obvious tailing in the mixed sample.
[0085] The results of the intercalation fragment length are shown in Table 1. The table illustrates the changes in intercalation fragment length with different quantile values and different intercalation ratios. From the results, it can be inferred that at lower quantile values, the change in intercalation fragment length does not significantly reflect changes in the intercalation ratio. When the quantile value is greater than 80, it can be seen that the intercalation fragment length increases with increasing intercalation ratio. At the 90th quantile, WBC contamination of 10% or higher can significantly differentiate from the 90th quantile data of the background sample; therefore, the 90th quantile value was selected as the final statistical measure. Based on the simulation results, the cutoff for the 90th quantile was set to 180.
[0086] Table 1. Simulation data doping gradient results
[0087]
[0088] Example 3: Experimental Verification
[0089] To verify the feasibility of this methodology, a real-world gradient incorporation experiment was designed. During library construction, DNA extracted from WBCs was incorporated into the blood of healthy individuals, creating four contamination gradients with incorporation ratios of 10%, 25%, and 50%. The incorporation gradients were constructed based on the cfDNA quality ratio. Sequencing and insert length analysis were performed on the incorporated samples, baseline healthy samples, and WBC samples. Each gradient was replicated twice. As shown in Table 2 below, 0% represents unincorporated baseline healthy samples, and 100% represents WBC samples. The statistical values are the average of the two replicates.
[0090] Table 2 shows a significant difference in insert length between healthy individuals and WBCs. The 95th and 90th percentiles of insert length in healthy individuals are 189.6 and 178.74, respectively, while the 95th and 90th percentiles for WBCs are 216.63 and 199.78, respectively. When the doping gradient is 10%, the insert length increases slightly. However, when the doping ratio exceeds 25%, a significant difference is observed in the numerical changes, exceeding the set thresholds of 180 (90th percentile) or 190 (95th percentile). In conclusion, the experimental simulation results suggest that the overall WBC contamination monitoring strategy is feasible and effective under contamination levels exceeding 25%.
[0091] Table 2. Real Data Reference Gradient Results
[0092]
[0093]
Claims
1. Methods for detecting genomic DNA contamination in cell-free DNA samples, including: (1) Obtain the length of the insert fragment for all reads in the sequencing results; (2) Construct a cluster analysis model based on the length of the inserted fragment obtained in step (1) and obtain the cluster analysis results; (3) Perform discriminant analysis based on the cluster analysis results obtained in step (2) to determine whether there is genomic DNA contamination in the free DNA sample; In step (2), the clustering analysis model adopts a Gaussian mixture model, which is composed of K single Gaussian models, and K is set to 3; the probability density function of the Gaussian mixture model is as follows: in, , , , π k These are considered as weights for each Gaussian distribution; The discriminant analysis in step (3) includes: dividing the Gaussian distributions included in the clustering analysis model constructed in step (2) according to their expected values. Sort by size from smallest to largest, and calculate the expected value based on the sorting result. The smallest component is set as the first peak; The discriminant analysis in step (3) includes: calculating the 90th or 95th percentile of the first peak, and determining whether the cell-free DNA sample is contaminated by genomic DNA by judging whether the 90th or 95th percentile of the first peak deviates from the preset threshold.
2. The method according to claim 1, wherein, The discriminant analysis in step (3) includes calculating the 90th percentile of the first peak and determining whether the cell-free DNA sample is contaminated by genomic DNA by judging whether the 90th percentile of the first peak deviates from the preset threshold.
3. The method according to claim 1, wherein, The preset threshold is selected from 180 to 190.
4. The method according to claim 3, wherein, The threshold is 180.
5. The method according to claim 3, wherein, The discriminant analysis in step (3) includes: Calculate the 90th or 95th percentile of the first peak. Determine whether the 90th or 95th percentile of the first peak is higher than 180. If the 90th or 95th percentile of the first peak is higher than 180, the cell-free DNA sample is determined to be contaminated with genomic DNA.
6. The method according to claim 5, wherein, The discriminant analysis in step (3) includes: Calculate the 90th percentile of the first peak. Determine if the 90th percentile of the first peak is higher than 180. If the 90th percentile of the first peak is higher than 180, the cell-free DNA sample is determined to be contaminated with genomic DNA.
7. The method according to claim 5, wherein, The discriminant analysis in step (3) includes: Calculate the 95th percentile of the first peak. Determine whether the 95th percentile of the first peak is higher than 190. If the 95th percentile of the first peak is higher than 190, the cell-free DNA sample is determined to be contaminated with genomic DNA.
8. The method according to any one of claims 1-7, wherein, The genomic DNA contamination originated from ruptured white blood cells in the blood sample.
9. An apparatus for detecting genomic DNA contamination in cell-free DNA samples, comprising: The read length acquisition module is configured to acquire the insert length of all reads in the sequencing results; The analysis module is configured to construct a clustering analysis model based on the length of the inserted fragment obtained by the read length acquisition module, and obtain the clustering analysis results; The determination module is configured to perform discriminant analysis based on the clustering analysis results obtained by the analysis module, thereby determining whether there is genomic DNA contamination in the free DNA sample; The analysis module includes clustering units configured as Gaussian mixture models, which are composed of K single Gaussian models, where K is set to 3; the probability density function of the Gaussian mixture model is as follows: in, , , , π k These are considered as weights for each Gaussian distribution; The determination module includes a sorting unit, configured to sort the Gaussian distributions included in the clustering analysis model according to the expected value. Sort by size from smallest to largest, and calculate the expected value based on the sorting result. The smallest component is set as the first peak; The determination module includes a contamination determination unit, which is configured to calculate the 90th or 95th percentile of the first peak and determine whether the cell-free DNA sample is contaminated by genomic DNA by determining whether the 90th or 95th percentile of the first peak deviates from a preset threshold.
10. The apparatus according to claim 9, wherein, The contamination determination unit is further configured to calculate the 90th percentile value of the first peak, and determine whether the cell-free DNA sample is contaminated by genomic DNA by determining whether the 90th percentile value of the first peak deviates from a preset threshold.
11. The apparatus according to claim 9, wherein, The contamination determination unit is further configured to calculate the 90th percentile value of the first peak, and determine whether the 90th percentile value of the first peak is higher than 180. If the 90th percentile value of the first peak is higher than 180, the cell-free DNA sample is determined to be contaminated with genomic DNA.
12. The apparatus according to claim 9, wherein, The contamination determination unit is further configured to calculate the 95th percentile of the first peak, and determine whether the 95th percentile of the first peak is higher than 190. If the 95th percentile of the first peak is higher than 190, the cell-free DNA sample is determined to be contaminated with genomic DNA.
13. Equipment for detecting genomic DNA contamination in cell-free DNA samples, including: One or more processors; A storage device on which one or more programs are stored; When the one or more programs are executed by the one or more processors, the one or more processors implement the method as described in any one of claims 1-8.
14. A computer-readable storage medium having a computer program stored thereon, wherein, When the computer program is executed by one or more processors, it implements the method as described in any one of claims 1-8.