Processing method and device for cnv interpretation, electronic equipment and storage medium
By calculating the probe region information and target function of the sample to be tested, and combining it with the sample standard set, the problem of distinguishing depth changes and noise in CNV samples with low tumor content is solved, thereby improving the detection rate and interpretation accuracy of CNV.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- 3D BIOMEDICINE SCI & TECH CO LTD
- Filing Date
- 2022-01-28
- Publication Date
- 2026-06-23
AI Technical Summary
Existing technologies struggle to effectively distinguish depth variations and noise in CNV samples with low tumor content, resulting in low CNV detection rates.
By acquiring probe region information of the sample to be tested, tumor sequence count assessment information is calculated using pre-defined specified relationships, including the number of sequences, tumor cells and normal cells within the corresponding probe region. CNV interpretation is then performed by combining the objective function and the sample standard set.
It improves the detection rate of CNVs, especially in cases with low tumor content, and can significantly and stably depict the length characteristics of cfDNA fragments, thus improving the accuracy and sensitivity of interpretation.
Smart Images

Figure CN116137178B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of gene detection, and more particularly to a processing method, apparatus, electronic device, and storage medium for CNV interpretation. Background Technology
[0002] CNV stands for Copy Number Variations, which can be understood as the amplification or deletion of DNA fragments longer than 1kb. CNVs often lead to significant impacts such as birth defects and cancer, and research on CNVs can assist in the study of disease development, diagnosis, and treatment.
[0003] In existing related technologies, in order to achieve CNV interpretation, CNVs of a gene / region can usually be interpreted based on the probe coverage depth. However, this method is difficult to distinguish between depth changes and noise caused by changes in tumor content copy number in CNV samples with low tumor content. Summary of the Invention
[0004] This invention provides a CNV interpretation processing method, apparatus, electronic device, and storage medium to solve the problem of low CNV detection rate.
[0005] According to a first aspect of the present invention, a method for CNV interpretation is provided, comprising:
[0006] Obtain the sample to be tested;
[0007] By substituting the information of each probe region in the sample to be tested into a pre-defined relationship, tumor sequence count assessment information of each probe region in the sample to be tested is obtained; the defined relationship characterizes the quantitative relationship between the tumor sequence count assessment information of the corresponding probe region and the following information:
[0008] The number of sequences in each box within the corresponding probe area;
[0009] The number of tumor cell sequences in each box within the corresponding probe area;
[0010] The number of normal cell sequences in each box within the corresponding probe region;
[0011] Based on the tumor sequence count assessment information of some or all of the test samples, determine whether the corresponding test sample is a positive variant of CNV.
[0012] Optionally, the function value of the objective function is related in a first manner to the insertion sequence size of normal cells in the corresponding bin of the corresponding probe region and the insertion sequence size of tumor cells in the corresponding bin of the corresponding probe region; the function value of the objective function is related in a second manner to the number of sequences in the corresponding bin of the corresponding probe region; if the first manner is positively correlated, then the second manner is negatively correlated, and if the first manner is negatively correlated, then the second manner is positively correlated.
[0013] By substituting the information of each probe region in the sample to be tested into a pre-defined specified relationship, tumor sequence count assessment information of each probe region in the sample to be tested is obtained, including:
[0014] For any probe region of the sample to be tested, the information of any probe region is substituted into the objective function to calculate the function value of the objective function for each bin of any probe region;
[0015] Based on the function values of each bin in any probe region, tumor sequence counting assessment information for any probe region is determined.
[0016] Optionally, based on the function values of each bin in any probe region, tumor sequence counting assessment information for that probe region is determined, including:
[0017] Among the function values of all bins in any probe region, the smallest function value is selected as the tumor sequence count evaluation information for that probe region.
[0018] Optionally, the objective function is characterized as:
[0019] X = ||cn normal f normal-d -n tumor f tumor-d ‖2;
[0020] in:
[0021] X is the function value of the objective function;
[0022] c refers to the number of sequences in the corresponding box within the corresponding probe region;
[0023] n normal This refers to the number of sequences of normal cells in the corresponding box within the corresponding probe region;
[0024] n tumor This refers to the number of sequences of tumor cells in the corresponding box within the corresponding probe region;
[0025] f normal-d The size of the insertion sequence for the normal cell in the corresponding box within the pre-calibrated probe region;
[0026] f tumor-d The insertion sequence size for tumor cells in the corresponding box within the pre-calibrated corresponding probe region.
[0027] Optionally, the CNV interpretation processing method further includes:
[0028] Obtain a standard set of samples; the standard set of samples includes samples of normal cells and samples of tumor cells.
[0029] For each human sample of normal cells in the aforementioned sample standard set, the insertion sequence size distribution information of each bin in each probe region is calculated, and the f is determined based on the insertion sequence size distribution information of each sample. normal-d ;
[0030] Based on the tumor cell samples from each human body in the aforementioned sample standard set, the insertion sequence size distribution information of each bin in each probe region is calculated, and the f is determined based on the insertion sequence size distribution information of each sample. tumor-d .
[0031] Optionally, based on tumor sequence count assessment information corresponding to some or all of the test samples, determine whether the corresponding test sample is a positive variant of CNV, including:
[0032] The tumor sequence count assessment information of the sample to be tested is denoised;
[0033] Samples that do not meet the quality requirements are screened out to determine the remaining samples to be tested;
[0034] Based on the tumor sequence count assessment information of the target gene region of the remaining test samples, it is determined whether the remaining test samples are positive variants of CNV.
[0035] Optionally, the tumor sequence count assessment information of the sample to be tested is denoised, including:
[0036] Obtain variant samples of CNV-positive variants that are not located in the target gene region;
[0037] Calculate the tumor sequence count assessment information of the variant sample;
[0038] SVD dimensionality reduction was performed on the tumor sequence count assessment information of the variant samples;
[0039] Based on the principal component regression analysis results in the SVD dimensionality reduction, the tumor sequence count assessment information of the test sample is denoised.
[0040] Optionally, samples that do not meet the quality requirements are screened out to determine the remaining samples to be tested, including:
[0041] Calculate the correlation coefficient between the regression analysis results and the tumor sequence count assessment information of the sample to be tested;
[0042] The correlation coefficient is compared with the quality control threshold, and the test samples with a correlation coefficient less than the quality control threshold are screened out to obtain the remaining test samples.
[0043] Optionally, based on tumor sequence count assessment information of some or all of the test samples, determine whether the corresponding test sample is a positive variant of CNV, including:
[0044] Calculate the first statistical value of the tumor sequence count assessment information of the target gene region of the remaining test samples and the second statistical value of the tumor sequence count assessment information of the target gene region of the negative samples, and compare the difference between the first statistical value and the second statistical value.
[0045] The differences are compared with the judgment threshold, and based on the comparison results, it is determined whether the corresponding test sample has positive CNV lesions.
[0046] Optionally, the first statistical value is the first T-statistic, and the second statistical value is the second T-statistic.
[0047] According to a second aspect of the present invention, a CNV interpretation processing apparatus is provided, comprising:
[0048] The test sample acquisition module is used to acquire the test sample;
[0049] The calculation module is used to substitute the information of each probe region in the sample to be tested into a pre-defined specified relationship to obtain the tumor sequence count assessment information of each probe region in the sample to be tested; the specified relationship represents the quantitative relationship between the tumor sequence count assessment information of each probe region and the following information:
[0050] The number of sequences in each box within the corresponding probe area;
[0051] The number of tumor cell sequences in each box within the corresponding probe area;
[0052] The number of normal cell sequences in each box within the corresponding probe region;
[0053] The interpretation module is used to interpret whether the corresponding test sample is a CNV positive variant based on the tumor sequence count assessment information of some or all of the test samples.
[0054] According to a third aspect of the present invention, an electronic device is provided, comprising a processor and a memory.
[0055] The memory is used to store code;
[0056] The processor is configured to execute code in the memory to implement the method relating to the first aspect and its alternatives.
[0057] According to a fourth aspect of the invention, a storage medium is provided having a computer program stored thereon, which, when executed by a processor, implements the methods involved in the first aspect and its alternatives.
[0058] The CNV interpretation processing method, apparatus, electronic device, and storage medium provided by this invention, targeting the characteristics of short cfDNA fragments generated by tumors, realizes CNV interpretation based on tumor sequence counting assessment information. This tumor sequence counting assessment information reflects the number of tumor cells and normal cell sequences in the corresponding probe region, and a specified relationship (e.g., their respective size distributions) has been defined. Therefore, even for samples with low tumor content, because this specified relationship is definite, the calculated tumor sequence counting assessment information can still significantly and stably depict characteristics such as "short cfDNA fragments generated by tumors." Thus, this invention can still achieve a good CNV detection rate even for samples with low tumor content.
[0059] In particular, in a further scheme, the tumor sequence count assessment information of the probe region is represented as: the ||cn| of each bin in the corresponding probe region. normal f normal-d -n tumor f tumor-d The minimum value in the calculation results of ‖2 is used to further improve the detection rate of CNV by using the minimum value to reflect the length of the tumor sequence in the probe region. Attached Figure Description
[0060] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0061] Figure 1 This is a flowchart illustrating a CNV interpretation process in one embodiment of the present invention.
[0062] Figure 2 This is a flowchart illustrating step S12 in one embodiment of the present invention;
[0063] Figure 3 This is a flowchart illustrating the CNV interpretation process in another embodiment of the present invention.
[0064] Figure 4This is a flowchart illustrating step S13 in one embodiment of the present invention;
[0065] Figure 5 This is a flowchart illustrating step S131 in one embodiment of the present invention;
[0066] Figure 6 This is a flowchart illustrating step S132 in one embodiment of the present invention;
[0067] Figure 7 This is a flowchart illustrating step S133 in one embodiment of the present invention;
[0068] Figure 8 This is a schematic diagram of the program module of the CNV interpretation processing device in one embodiment of the present invention;
[0069] Figure 9 This is a schematic diagram of the program module of the CNV interpretation processing device in another embodiment of the present invention;
[0070] Figure 10 This is a schematic diagram of the structure of an electronic device according to an embodiment of the present invention. Detailed Implementation
[0071] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0072] The terms “first,” “second,” “third,” “fourth,” etc. (if present) in the specification, claims, and accompanying drawings of this invention are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that embodiments of the invention described herein can be implemented in orders other than those illustrated or described herein. Furthermore, the terms “comprising” and “having,” and any variations thereof, are intended to cover a non-exclusive inclusion; for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus.
[0073] The technical solution of the present invention will be described in detail below with reference to specific embodiments. These specific embodiments can be combined with each other, and the same or similar concepts or processes may not be described again in some embodiments.
[0074] Please refer to Figure 1 This invention provides a method for CNV interpretation, including:
[0075] S11: Obtain the sample to be tested;
[0076] S12: Substitute the information of each probe region in the sample to be tested into the pre-calibrated specified relationship to obtain the tumor sequence count assessment information of each probe region in the sample to be tested;
[0077] S13: Based on the tumor sequence count assessment information of some or all of the test samples, determine whether the corresponding test sample is a positive variant of CNV.
[0078] The sample to be tested can be understood as the gene sequence to be tested. For example, it can be a partial or complete gene sequence extracted for a specific object (such as a human). Specifically, the sample to be tested can have undergone other preprocessing, such as (but not limited to) noise reduction, filtering, etc. In some examples, it can also be unprocessed.
[0079] The probe region can be understood as a probe.
[0080] The box mentioned here can be understood as a bin;
[0081] The sequence can be understood as reads or read;
[0082] The specified relationship characterizes the quantitative relationship between tumor sequence count assessment information of the corresponding probe region and the following information:
[0083] The number of sequences in each box within the corresponding probe area;
[0084] The number of tumor cell sequences in each box within the corresponding probe area;
[0085] The corresponding number of normal cell sequences in each box within the probe region; correspondingly, the tumor sequence count assessment information can be understood as any information that can describe the tumor sequence count results in the probe region and form the above-specified relationship with the number of sequences in each box, the number of tumor cell sequences in each box, and the number of normal cell sequences in each box.
[0086] Furthermore, the tumor sequence count assessment information reflects the number of tumor cells and normal cells in the corresponding probe region, and the specified relationships (such as their respective size distributions) have been defined. Therefore, even for samples with low tumor content, because the specified relationships are determined, the calculated tumor sequence count assessment information can still depict the characteristics such as "shorter length of cfDNA fragments generated by tumors" quite significantly and stably, and this characteristic will not be unable to be depicted (or cannot be depicted significantly) due to the reduction in content.
[0087] It is evident that this invention can still achieve a good CNV detection rate even for samples with low tumor content.
[0088] In a specific example, since the designation relationship is based on standard samples, even samples with low tumor content will still conform to the standard sample situation because the designation relationship has been labeled and remains unchanged. This designation relationship will not be affected (or will be minimally affected) by the reduction in tumor content. Furthermore, based on the characteristic that the cfDNA fragments generated by tumors are relatively short, for samples with low content, the above process can assign a relatively higher weight to short fragments when calculating depth (higher than the weight shown by the actual low content), thereby increasing the difference between regions with copy number variation and regions without variation and improving the detection rate.
[0089] The specified relationship can be characterized by a combination of one or more functions. For example, the function value of the target function is related in a first way to the insertion sequence size of normal cells in the corresponding bin of the corresponding probe region and the insertion sequence size of tumor cells in the corresponding bin of the corresponding probe region. The function value of the target function is also related in a second way to the number of sequences in the corresponding bin of the corresponding probe region. If the first way is positively correlated, then the second way is negatively correlated; if the first way is negatively correlated, then the second way is positively correlated. Thus, the function value of the target function can characterize the distribution of the insertion sequence size of normal cells and tumor cells in a bin of the probe region.
[0090] For the corresponding information, please refer to [link / reference]. Figure 2 Step S12 may include:
[0091] S121: For any probe region of the sample to be tested, substitute the information of any probe region into the objective function, and calculate the function value of the objective function for each bin of any probe region;
[0092] S122: Based on the function values of each bin in any probe region, determine the tumor sequence count assessment information for any probe region.
[0093] In one embodiment, the objective function is characterized as follows:
[0094] ‖cn normal f normal-d -n tumor f tumor-d ‖2;
[0095] in:
[0096] X is the function value of the objective function;
[0097] c represents the number of sequences in the corresponding box within the corresponding probe region;
[0098] n normal It characterizes the number of sequences of normal cells in the corresponding box within the corresponding probe region;
[0099] n tumor It represents the number of sequences of tumor cells in the corresponding box within the corresponding probe region;
[0100] f normal-d It characterizes the size of the inserted sequence in the corresponding normal cell in the corresponding box within the pre-calibrated probe region; it can be calibrated manually based on experience or based on a sample standard machine.
[0101] f tumor-d It characterizes the insertion sequence size of tumor cells in the corresponding box within the pre-calibrated corresponding probe region; it can be calibrated manually based on experience or based on a sample standard set.
[0102] The insertion sequence size can be understood as the Insert size. Correspondingly, the insertion sequence sizes can be combined to form the insertion sequence size distribution information, which can be represented as the Insert size distribution.
[0103] In the above implementation, c is calculated relative to n. mormal f normal-d +n tumor f tumor-d The difference, in other examples, can also be calculated as a ratio of c to n. normal f normal-d +n tumor f tumor-d The ratio, or n normal f normal-d +n tumor f tumor-d The ratio relative to c.
[0104] Furthermore, in a specific example, tumor sequence count assessment information can be characterized as Tumor reads(n tumor (Or Tumor reads.)
[0105] The above is just one example of calculating tumor sequence count assessment information. In other examples, the objective function can also be constructed using the 1-norm, the weighted 2-norm, the KL distance, and the Warsserstein distance.
[0106] In one example of step S122, when the first method is negatively correlated and the second method is positively correlated, the smallest function value can be selected from the function values of all bins in any probe region as the tumor sequence count evaluation information of any probe region. At this time, the specified relationship also includes the minimum value function to achieve the process of selecting the smallest function value.
[0107] In another example, if the first method is positively correlated and the second method is negatively correlated, the largest function value among all bins in any probe region can be selected as the tumor sequence count assessment information for any probe region. In this case, the specified relationship also includes the maximum value function to achieve the process of selecting the largest function value.
[0108] In other examples, the statistical values (e.g., mean, median, etc.) of the function values of all bins in any probe region can also be used as the tumor sequence count assessment information for any probe region.
[0109] Using a sample standard machine to calibrate f normal-d with f tumor-d In one implementation, referring to Figure 3, the CNV interpretation processing method further includes:
[0110] S14: Obtain the standard sample set;
[0111] S15: For each human sample of normal cells in the sample standard set, calculate the insertion sequence size distribution information of each bin in each probe region, and determine the f based on the insertion sequence size distribution information of each sample. normal-d ;
[0112] S16: Based on the tumor cell samples from each human body in the sample standard set, calculate the insertion sequence size distribution information of each bin in each probe region, and determine the f based on the insertion sequence size distribution information of each sample. tumor-d .
[0113] The standard set of samples can include samples of normal cells and samples of tumor cells;
[0114] In a specific example, for a standard sample set, non-negative matrix decomposition can be used to distinguish the insert size distribution of cfDNA from normal cells (i.e., samples from normal cells) and ctDNA from tumor cells (i.e., samples from tumor cells).
[0115] For example, based on cfDNA samples from multiple healthy individuals, the insert size distribution of reads that intersect with each probe region can be extracted (i.e., the insertion sequence size distribution information of normal cells).
[0116] Then, an F can be formed for each healthy human sample. normal =(f normal(1) ,…,f normal(m) Furthermore, for the i-th bin, multiple f can be targeted. normal(i) Take the median as f normal-d(i) ;
[0117] It can be seen that the f is determined based on the insertion sequence size distribution information of each sample. normal The process can be, for example, as follows: for any given bin, take the statistical value (e.g., median, mean, etc.) of the insert size of all human samples of normal cells in that bin as the f. normal-d .
[0118] Based on this, the f tumor-d The calculation process can be referred to the above f. normal-d The calculation process can be implemented in one way or another. For example, it can also be based on f. normal-d Estimate the insert size distribution of ctDNA in multiple LBP non-CNV tumor samples, and calculate the corresponding f values. tumor-d .
[0119] In one example, let f be the sequence size distribution in the i-th sample. i Then it can be minimized In this way, determine the f tumor-d The specific function used to take the minimum value (i.e. (where Y represents the function value of the specific function), it can also be replaced with a function constructed using the 1-norm, weighted 2-norm, KL-distance, and Warsserstein distance, etc.
[0120] Where a i f tumor-d It can be iteratively changed to find the value of f that minimizes the value of a specific function. tumor-df required for step S16 tumor-d .
[0121] After obtaining the tumor sequence count assessment information in step S12, since the tumor sequence count assessment information will inevitably differ depending on whether a CNV positive variant has occurred or not, any statistical method can be used to process the tumor sequence count assessment information to determine whether the corresponding sample to be tested has a CNV positive variant. Regardless of the method used to implement step S13, it does not depart from the scope of the embodiments of the present invention.
[0122] Figure 4 The embodiment shown illustrates one implementation process of step S13, but the actual solution is not limited to this.
[0123] Please refer to Figure 4 Step S13 may include:
[0124] S131: Denoise the tumor sequence count assessment information of the sample to be tested;
[0125] S132: Screen out the test samples that do not meet the quality requirements and determine the remaining test samples;
[0126] S133: Based on the tumor sequence count assessment information of the target gene region of the remaining test samples, determine whether the remaining test samples are positive variants of CNV.
[0127] The target gene region can refer to any gene region of interest, such as the ERBB2 region.
[0128] In the above process, noise reduction and screening of low-quality samples can avoid the influence of noise and low-quality samples on the interpretation results, effectively improving the accuracy of the interpretation. Furthermore, any noise reduction method in this field can be used as an option for step S131 above, and any low-quality sample screening method in this field can be used as an option for step S132 above.
[0129] In some examples of embodiments of the present invention, the above noise reduction and / or filtering may not be implemented.
[0130] In one implementation method, please refer to Figure 5 Step S131 may include:
[0131] S1311: Obtain variant samples of CNV-positive variants that are not located in the target gene region;
[0132] S1312: Calculate the tumor sequence count assessment information of the variant sample;
[0133] S1313: Perform SVD dimensionality reduction on the tumor sequence count assessment information of the variant sample;
[0134] S1314: Based on the regression analysis results of principal components in the SVD dimensionality reduction, the tumor sequence count assessment information of the test sample is denoised.
[0135] The process of calculating the tumor sequence count assessment information of the variant sample in step S1312 can be understood by referring to the method of calculating the tumor sequence count assessment information in the previous text, and will not be repeated here.
[0136] SVD, specifically Singular Value Decomposition, can be understood as singular value decomposition. Therefore, SVD dimensionality reduction can be understood as a way to achieve dimensionality reduction through principal component analysis.
[0137] In a specific example, 150 samples of CNV-positive variants in non-target gene regions can be taken, and the tuner reads for each probe can be calculated. Dimensionality reduction using SVD (a method of PCA dimensionality reduction) can be performed, and the principal components can be used as background noise. Then, the robust regression results of the principal components are subtracted from the tuner reads of the samples with CNV-positive variants in non-target gene regions to obtain the denoised tuner reads.
[0138] The screening in step S132 can be performed by quality control analysis to exclude low-quality samples caused by factors such as tumor chromosomal instability or changes in experimental conditions. Based on this objective, any screening method can be selected; one optional implementation method is given below.
[0139] In one implementation method, please refer to Figure 6 Step S132 may include:
[0140] S1321: Calculate the correlation coefficient between the regression analysis results and the tumor sequence count assessment information of the sample to be tested;
[0141] S1322: Compare the correlation coefficient with the quality control threshold, and filter out the test samples whose correlation coefficient is less than the quality control threshold to obtain the remaining test samples.
[0142] The quality control threshold can be any preset threshold (it can be designed based on experience or based on statistical results);
[0143] Since regression analysis results can fully reflect the characteristics of noise, the smaller the correlation coefficient, the more correlated the sample is with the noise. In the above process, the samples with high correlation are screened out to ensure the quality of the remaining test samples.
[0144] In a specific example, one can calculate the correlation coefficient between tumor reads and principal components, set a quality control threshold, and filter out samples with correlation coefficients below the threshold.
[0145] In another embodiment, the method of checking whether the tumor sequence count assessment information after noise reduction is related to the GC content of the probe can be used to screen out unqualified test samples. In yet another embodiment, the variance of the mean value of the tumor sequence count assessment information of each chromosome can be calculated, and it can be determined whether the variance exceeds the corresponding threshold. Then, by screening out samples with variance exceeding the threshold, unqualified test samples can be screened out.
[0146] In one implementation method, please refer to Figure 7 Step S133 may include:
[0147] S1331: Calculate the first statistical value of the tumor sequence count assessment information of the target gene region of the remaining test samples and the second statistical value of the tumor sequence count assessment information of the target gene region of the negative samples, and compare the difference between the first statistical value and the second statistical value.
[0148] S1332: Compare the difference with the judgment threshold, and based on the comparison result, determine whether the corresponding test sample has a positive CNV lesion.
[0149] The first and second statistical values can be T-statistics, which can be understood as T-statistics; in other examples, the p-value of the normal distribution can also be used as the statistical value.
[0150] In addition, in one example of using the P-value, a P-value can be obtained by comparing the first statistic corresponding to the test sample with the second statistic corresponding to the negative sample. This P-value represents the difference and can be understood as a comparison result. Then, based on the P-value and the decision threshold (e.g., 0.05 or 0.01), it is determined whether the difference is significant, thereby determining whether a positive lesion has occurred. Specifically, the method to determine whether the difference is significant can be achieved by using a P-value less than 0.05 or 0.01 (other values can also be used, i.e., a decision threshold).
[0151] In some schemes, a p-value (P-value) can be directly calculated from statistical values (e.g., t-statistics). A p-value less than a given threshold is considered significant. This method determines whether the sample itself is significant relative to a given distribution (e.g., a normal distribution). However, in reality, the statistic may deviate from the theoretically given distribution. Therefore, in the above schemes, a horizontal comparison method can be adopted. That is, the statistic of the test sample (i.e., the first statistic) and the statistic of the negative sample (i.e., the second statistic) are compared according to the above process to obtain a p-value, which is then used to determine significance.
[0152] In addition, if the p-value is used, p-value correction for multiple tests can also be achieved (e.g., Bonferroni correction).
[0153] In step S1332, if the difference is less than the judgment threshold, the corresponding test sample is judged to have no positive lesions; if it is greater than the judgment threshold, the corresponding test sample is judged to have positive lesions.
[0154] In designing this decision threshold, one example is to compare the tumor reads of the target gene region of the sample with the background noise and the tumor reads of the entire genome, count the corresponding T-stat, and then calculate and determine the decision threshold for each target gene based on the statistically obtained T-stat.
[0155] In addition, samples that fail the quality inspection in step S132 (i.e., the rejected samples) can be judged. If they are strongly positive samples, they can also be determined to have CNV positive lesions. For example, the judgment can be made based on their T-statistic value. If it is greater than a threshold, it is determined that they have CNV positive lesions.
[0156] The paper published in 2019 at doi:10.3390 / genes10110926 (titled: CNV Detection from Circulating Tumor DNA in Late Stage Non-Small Cell Lung Cancer Patients) is used as an example of existing technology. This workflow includes the following steps: 1. Using the sample's BAM file, the capture region's BED file, and the human reference genome sequence's FASTA file as input files, the average depth of each probe region is calculated. 2. The GC content of each capture region, the length of the intersection between capture regions and other capture regions, and the depth of the healthy human sample used for comparison in each region are calculated, followed by noise reduction. 3. The amplification threshold is determined: the median depth of the noise-reduced sample plus three times the standard deviation of the depth is used as the threshold; if the average depth of the target region is greater than this threshold, amplification is considered. Compared to this method, there are the following differences: 1. This workflow uses depth as a feature for detecting copy number changes, while this method uses tumor reads as a feature for detecting copy number changes. 2. The threshold used in this workflow is arbitrary, while the threshold in this method is based on statistical methods.
[0157] Blood samples were collected from 57 patients with advanced gastrointestinal tumors, and all tissue samples underwent IHC testing. In cases where 2 cases were IHC negative and the rest were positive (1+ was considered negative, 2+ and above were considered positive):
[0158] If the processing procedure in this literature is used for CNV identification, 35 out of 55 IHC-positive samples were detected, and none of the 2 IHC-negative samples were detected, with a sensitivity of 63% and a specificity of 100%. If the processing procedure of this invention is used for CNV identification, 44 out of 55 IHC-positive samples were detected, and none of the 2 IHC-negative samples were detected, with a sensitivity of 80% and a specificity of 100%. It is evident that this invention achieves a 17% higher sensitivity while maintaining the same high specificity. The results are shown in the table below:
[0159]
[0160]
[0161]
[0162]
[0163]
[0164] Please refer to Figure 8 This invention also provides a CNV interpretation processing device 2, comprising:
[0165] The test sample acquisition module 21 is used to acquire the test sample;
[0166] Calculation module 22 is used to substitute the information of each probe region in the sample to be tested into a pre-defined specified relationship to obtain the tumor sequence count assessment information of each probe region in the sample to be tested; the specified relationship represents the quantitative relationship between the tumor sequence count assessment information of each probe region and the following information:
[0167] The number of sequences in each box within the corresponding probe area;
[0168] The number of tumor cell sequences in each box within the corresponding probe area;
[0169] The number of normal cell sequences in each box within the corresponding probe region;
[0170] The interpretation module 23 is used to evaluate the tumor sequence count information based on some or all of the test samples and to interpret whether the corresponding test sample is a positive variant of CNV.
[0171] Optionally, the specified relationship includes a preset objective function, and:
[0172] The value of the objective function is correlated in a first manner with the insertion sequence size of normal cells in the corresponding bin within the corresponding probe region and the insertion sequence size of tumor cells in the corresponding bin within the corresponding probe region; the value of the objective function is correlated in a second manner with the number of sequences in the corresponding bin within the corresponding probe region; if the first manner is positively correlated, then the second manner is negatively correlated, and if the first manner is negatively correlated, then the second manner is positively correlated.
[0173] The computing module 22 is specifically used for:
[0174] For any probe region of the sample to be tested, the information of any probe region is substituted into the objective function to calculate the function value of the objective function for each bin of any probe region;
[0175] Based on the function values of each bin in any probe region, tumor sequence counting assessment information for any probe region is determined.
[0176] Optionally, the calculation module 22 is specifically used for:
[0177] When the first method is negatively correlated and the second method is positively correlated, the smallest function value among all bins in any probe region is selected as the tumor sequence count evaluation information for that probe region.
[0178] Optionally, the objective function is characterized as:
[0179] ‖cn normal f normal-d -n tumor f tumor-d ‖2;
[0180] in:
[0181] c refers to the number of sequences in the corresponding box within the corresponding probe region;
[0182] n normal This refers to the number of sequences of normal cells in the corresponding box within the corresponding probe region;
[0183] n tumor This refers to the number of sequences of tumor cells in the corresponding box within the corresponding probe region;
[0184] f normal-d The size of the insertion sequence for the normal cell in the corresponding box within the pre-calibrated probe region;
[0185] f tumor-d The insertion sequence size for tumor cells in the corresponding box within the pre-calibrated corresponding probe region.
[0186] Optional, please refer to Figure 9 The CNV interpretation processing device 2 also includes:
[0187] The standard sample acquisition module 24 is used to acquire a standard sample set; the standard sample set includes samples of normal cells and samples of tumor cells.
[0188] f normal-d Calculation module 25 is used to calculate the insertion sequence size distribution information of each bin in each probe region for each human sample of normal cells in the sample standard set, and to determine the f based on the insertion sequence size distribution information of each sample. normal-d ;
[0189] f tumor-d Calculation module 26 is used to calculate the insertion sequence size distribution information of each bin in each probe region based on the tumor cell sample from each human body in the sample standard set, and to determine the f based on the insertion sequence size distribution information of each sample. tumor-d .
[0190] Optionally, the judgment module 23 is specifically used for:
[0191] The tumor sequence count assessment information of the sample to be tested is denoised;
[0192] Samples that do not meet the quality requirements are screened out to determine the remaining samples to be tested;
[0193] Based on the tumor sequence count assessment information of the target gene region of the remaining test samples, it is determined whether the remaining test samples are positive variants of CNV.
[0194] Optionally, the judgment module 23 is specifically used for:
[0195] Obtain variant samples of CNV-positive variants that are not located in the target gene region;
[0196] Calculate the tumor sequence count assessment information of the variant sample;
[0197] SVD dimensionality reduction was performed on the tumor sequence count assessment information of the variant samples;
[0198] Based on the principal component regression analysis results in the SVD dimensionality reduction, the tumor sequence count assessment information of the test sample is denoised.
[0199] Optionally, the judgment module 23 is specifically used for:
[0200] Calculate the correlation coefficient between the regression analysis results and the tumor sequence count assessment information of the sample to be tested;
[0201] The correlation coefficient is compared with the quality control threshold, and the test samples with a correlation coefficient less than the quality control threshold are screened out to obtain the remaining test samples.
[0202] Optionally, the judgment module 23 is specifically used for:
[0203] Calculate the first statistical value of the tumor sequence count assessment information of the target gene region of the remaining test samples, and the second T-statistic of the tumor sequence count assessment information of the target gene region of the negative samples, and compare the difference between the first statistical value and the second statistical value;
[0204] The differences are compared with the judgment threshold, and based on the comparison results, it is determined whether the corresponding test sample has positive CNV lesions.
[0205] Optionally, both the first statistical value and the second statistical value are T-statistics.
[0206] Please refer to Figure 10 An electronic device 3 is provided, comprising:
[0207] Processor 31; and,
[0208] Memory 32 is used to store the executable instructions of the processor;
[0209] The processor 31 is configured to execute the methods described above by executing the executable instructions.
[0210] The processor 31 can communicate with the memory 32 via the bus 33.
[0211] This invention also provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the methods described above.
[0212] Those skilled in the art will understand that all or part of the steps of the above-described method embodiments can be implemented by hardware related to program instructions. The aforementioned program can be stored in a computer-readable storage medium. When executed, the program performs the steps of the above-described method embodiments; and the aforementioned storage medium includes various media capable of storing program code, such as ROM, RAM, magnetic disks, or optical disks.
[0213] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some or all of the technical features therein. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the scope of the technical solutions of the embodiments of the present invention.
Claims
1. A method for CNV interpretation, characterized in that, include: Obtain the sample to be tested; By substituting the information of each probe region in the sample to be tested into a pre-defined relationship, tumor sequence count assessment information of each probe region in the sample to be tested is obtained; the defined relationship characterizes the quantitative relationship between the tumor sequence count assessment information of the corresponding probe region and the following information: The number of sequences in each box within the corresponding probe area; The number of tumor cell sequences in each box within the corresponding probe area; The number of normal cell sequences in each box within the corresponding probe region; Based on the tumor sequence count assessment information of some or all of the test samples, determine whether the corresponding test sample is a positive variant of CNV.
2. The CNV interpretation processing method according to claim 1, characterized in that, The specified relationship includes a preset objective function, and: The value of the objective function is correlated in a first manner with the insertion sequence size of normal cells in the corresponding bin within the corresponding probe region and the insertion sequence size of tumor cells in the corresponding bin within the corresponding probe region; the value of the objective function is correlated in a second manner with the number of sequences in the corresponding bin within the corresponding probe region; if the first manner is positively correlated, then the second manner is negatively correlated, and if the first manner is negatively correlated, then the second manner is positively correlated. By substituting the information of each probe region in the sample to be tested into a pre-defined specified relationship, tumor sequence count assessment information of each probe region in the sample to be tested is obtained, including: For any probe region of the sample to be tested, the information of any probe region is substituted into the objective function to calculate the function value of the objective function for each bin of any probe region; Based on the function values of each bin in any probe region, tumor sequence counting assessment information for any probe region is determined.
3. The CNV interpretation processing method according to claim 2, characterized in that, Based on the function values of each bin in any probe region, tumor sequence counting assessment information for that probe region is determined, including: When the first method is negatively correlated and the second method is positively correlated, the smallest function value among all bins in any probe region is selected as the tumor sequence count evaluation information for that probe region.
4. The CNV interpretation processing method according to claim 2, characterized in that, The objective function is characterized as follows: X=‖c-n normal f normal-d -n tumor f tumor-d ‖2; in: X is the function value of the objective function; c represents the number of sequences in the corresponding box within the corresponding probe region; n normal It characterizes the number of sequences of normal cells in the corresponding box within the corresponding probe region; n tumor It characterizes the number of sequences of tumor cells in the corresponding box within the corresponding probe region; f normal-d It characterizes the size of the inserted sequence in the corresponding normal cell in the corresponding box within the pre-calibrated corresponding probe region; f tumor-d It characterizes the insertion sequence size of tumor cells in the corresponding box within the pre-calibrated corresponding probe region.
5. The CNV interpretation processing method according to claim 4, characterized in that, Also includes: Obtain a standard set of samples; the standard set of samples includes samples of normal cells and samples of tumor cells. For each human sample of normal cells in the aforementioned sample standard set, the insertion sequence size distribution information of each bin in each probe region is calculated, and the f is determined based on the insertion sequence size distribution information of each sample. normal-d ; For each human sample of tumor cells in the aforementioned sample standard set, the insertion sequence size distribution information of each bin in each probe region is calculated, and the f is determined based on the insertion sequence size distribution information of each sample. tumor-d .
6. The CNV interpretation processing method according to any one of claims 1 to 5, characterized in that, Based on tumor sequence count assessment information corresponding to some or all of the test samples, determine whether the corresponding test sample is a positive variant of CNV, including: The tumor sequence count assessment information of the sample to be tested is denoised; Samples that do not meet the quality requirements are screened out to determine the remaining samples to be tested; Based on the tumor sequence count assessment information of the target gene region of the remaining test samples, it is determined whether the remaining test samples are positive variants of CNV.
7. The CNV interpretation processing method according to claim 6, characterized in that, Denoising the tumor sequence count assessment information of the sample to be tested includes: Obtain variant samples of CNV-positive variants that are not located in the target gene region; Calculate the tumor sequence count assessment information of the variant sample; SVD dimensionality reduction was performed on the tumor sequence count assessment information of the variant samples; Based on the principal component regression analysis results in the SVD dimensionality reduction, the tumor sequence count assessment information of the test sample is denoised.
8. The CNV interpretation processing method according to claim 7, characterized in that, Samples that do not meet the quality requirements are screened out, and the remaining samples to be tested are determined, including: Calculate the correlation coefficient between the regression analysis results and the tumor sequence count assessment information of the sample to be tested; The correlation coefficient is compared with the quality control threshold, and the test samples with a correlation coefficient less than the quality control threshold are screened out to obtain the remaining test samples.
9. The CNV interpretation processing method according to claim 6, characterized in that, Based on tumor sequence count assessment information from some or all of the test samples, determine whether the corresponding test sample has a positive CNV variant, including: Calculate the first statistical value of the tumor sequence count assessment information of the target gene region of the remaining test samples and the second statistical value of the tumor sequence count assessment information of the target gene region of the negative samples, and compare the difference between the first statistical value and the second statistical value. The differences are compared with the judgment threshold, and based on the comparison results, it is determined whether the corresponding test sample has a positive CNV lesion.
10. The CNV interpretation processing method according to claim 9, characterized in that, Both the first statistical value and the second statistical value are T-statistics.
11. A CNV interpretation processing device, characterized in that, include: The test sample acquisition module is used to acquire the test sample; The calculation module is used to substitute the information of each probe region in the sample to be tested into a pre-defined specified relationship to obtain the tumor sequence count assessment information of each probe region in the sample to be tested; the specified relationship represents the quantitative relationship between the tumor sequence count assessment information of each probe region and the following information: The number of sequences in each box within the corresponding probe area; The number of tumor cell sequences in each box within the corresponding probe area; The number of normal cell sequences in each box within the corresponding probe region; The interpretation module is used to interpret whether the corresponding test sample is a CNV positive variant based on the tumor sequence count assessment information of some or all of the test samples.
12. An electronic device, characterized in that, Including processor and memory, The memory is used to store code; The processor is configured to execute code in the memory to implement the method according to any one of claims 1 to 11.
13. A storage medium having a computer program stored thereon, which, when executed by a processor, implements the method of any one of claims 1 to 11.