Method of determining the gestational status of a pregnant woman
By constructing a prediction model based on parameters such as fetal cell-free nucleic acid concentration and gestational age, the accuracy problem of fetal cfDNA concentration in predicting preterm birth was solved, achieving rapid and accurate prediction of pregnancy status and reducing the number of blood collections and costs.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- BGI GENOMICS CO LTD
- Filing Date
- 2020-06-04
- Publication Date
- 2026-06-12
AI Technical Summary
In existing technologies, the correlation between fetal cfDNA concentration and preterm birth has not been effectively utilized, resulting in a lack of accuracy in preterm birth prediction methods, and the fetal fibronectin molecular diagnosis suffers from an excessively high false positive rate.
A predictive model is constructed using parameters such as the concentration of cell-free fetal nucleic acid in maternal peripheral blood, gestational age at sampling, and maternal physical characteristics (such as height, weight, and age). The model is then used to make predictions through linear regression, logistic regression, or random forest models to improve its accuracy.
It enables rapid and accurate prediction of a pregnant woman's pregnancy status, including the probability of premature birth, gestational age at delivery, and other related complications, through a single blood sample in early pregnancy, reducing the risks and costs of multiple blood samples.
Smart Images

Figure CN115516103B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of biotechnology, particularly to non-invasive prenatal genetic testing, specifically to methods and apparatus for determining the gestational status of pregnant women, and corresponding methods and apparatus for constructing machine learning prediction models. Background Technology
[0002] Fetal cfDNA is present in the cell-free DNA (cfDNA) in the plasma of pregnant women. This fetal cfDNA mainly comes from the placenta, but also comes from hematopoietic stem cells or directly from the exchange between the fetus and the mother. Studies have confirmed that the concentration of fetal cfDNA in the plasma of pregnant women is related to various pregnancy complications such as premature birth, intrauterine growth retardation, and eclampsia.
[0003] In recent years, numerous articles have been published on the correlation between fetal cfDNA concentration in maternal plasma and preterm birth. However, there is still no definitive conclusion on whether there is a correlation between fetal cfDNA concentration and preterm birth, and different research papers have yielded contradictory conclusions.
[0004] Currently, effective methods for predicting preterm birth based on fetal cfDNA concentration are still under development. Summary of the Invention
[0005] This application is based on the inventor's discoveries and understanding of the following facts and problems:
[0006] To date, most clinical methods for predicting threatened preterm labor involve detecting fetal fibronectin levels in the pregnant woman's vagina. However, this method is only an auxiliary tool and cannot be used as a final diagnostic criterion. Currently, there is no effective clinical method for diagnosing preterm labor.
[0007] Multiple reports have shown a correlation between fetal cfDNA concentration in maternal plasma and various pregnancy complications such as preterm birth and preeclampsia. Some studies attempted to predict preterm birth using fetal cfDNA concentration as a marker, but ultimately failed due to insufficient correlation. To date, there is no effective method for predicting preterm birth using fetal cfDNA concentration.
[0008] The clinical method of using fetal fibronectin molecules to assist in the diagnosis of preterm birth has a high false positive rate. Statistics show that among pregnant women who test positive for fetal fibronectin molecules, less than 3% of the samples are ultimately diagnosed with preterm birth. The high false positive rate has made this diagnostic method highly questionable.
[0009] Previous methods that used only the concentration of fetal cfDNA in maternal plasma as a single factor for predicting preterm birth had insufficient correlation and failed to establish an effective predictive model.
[0010] Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
[0011] According to one aspect of the present invention, a method for constructing a predictive model is provided. According to an embodiment of the present invention, the predictive model is used to determine the gestational status of a pregnant woman, comprising: (i) constructing a training set and an optional validation set, wherein the training set and the validation set are each composed of multiple pregnant woman samples, the pregnant woman samples having known gestational status; (ii) for each pregnant woman sample in the training set, determining predetermined parameters for the pregnant woman, the predetermined parameters including the fetal cell-free nucleic acid concentration in the pregnant woman's peripheral blood and the gestational age at which the peripheral blood was sampled; and (iii) constructing the predictive model based on the known gestational status and the predetermined parameters. The method according to an embodiment of the present invention utilizes the fetal cell-free nucleic acid concentration, gestational age at sampling, pregnant woman's physical characteristics at the time of sampling (e.g., height, weight, body mass index, age), and the pregnant woman's gestational status (e.g., preterm birth, gestational age at delivery) obtained from a single blood sample of multiple pregnant women to construct a predictive model for the pregnant woman's gestational status. The method includes two key factors: fetal cell-free nucleic acid concentration and gestational age at sampling, thus improving the accuracy of the model.
[0012] According to embodiments of the present invention, the above method may further include at least one of the following additional technical features:
[0013] According to embodiments of the present invention, the pregnancy status includes the labor phase of the pregnant woman. The method according to embodiments of the present invention can predict the probability of preterm birth, gestational age at delivery, intrauterine growth restriction, and other pregnancy complications correlated with fetal cell-free nucleic acid concentration.
[0014] According to embodiments of the present invention, the gestational age for blood sampling is 13 to 25 weeks. The inventors found that the correlation between fetal concentration and preterm birth is weaker when the gestational age for blood sampling is less than or equal to 12 weeks and between 26 and 30 weeks, while the correlation is stronger when the gestational age for blood sampling is between 13 and 25 weeks.
[0015] According to embodiments of the present invention, the prediction model is at least one of a linear regression model, a logistic regression model, and a random forest. According to the method of the present invention, the prediction model can theoretically be any statistical model that generalizes to different differential distributions.
[0016] According to an embodiment of the present invention, the predetermined parameters further include the pregnant woman's height, weight, and age.
[0017] According to an embodiment of the present invention, in step (iii), the following formula is used: l i =β0+β icff xicff +β isample x isample +β iheight x iheight +β iweight x iweight +β iage x iage +ε i For i = 1, ..., p, using the training set and validation set, determine β0, β... icff β isample β iheight β iweight β iage and ε i The value of , where i represents the number of the pregnant woman sample in the training set; l i It is a value determined based on the known gestational status of the i-th pregnant woman sample, wherein, for the preterm birth sample, l i For full-term babies, the value is 1. i x is 0; icff x represents the concentration of cell-free fetal nucleic acid in the sample of the pregnant woman described in number i; isample Indicates the gestational week of the blood sample taken from the i-th pregnant woman; x iheight x represents the height of the pregnant woman sample i; iweight x represents the weight of the pregnant woman sample i; iage ε represents the age of the pregnant woman sample i. i This indicates the sequencing error of peripheral blood from the i-th pregnant woman sample.
[0018] In a second aspect, the present invention proposes a system for constructing a predictive model. According to an embodiment of the present invention, the predictive model is used to determine the gestational status of a pregnant woman, comprising: a training set construction module, wherein the training set consists of multiple pregnant woman samples, the pregnant woman samples having a known gestational status; a predetermined parameter determination module, the predetermined parameter determination module being connected to the training set construction module, for each of the pregnant woman samples in the training set, determining predetermined parameters for the pregnant woman, the predetermined parameters including the concentration of fetal cell-free nucleic acid in the pregnant woman's peripheral blood and the gestational age at which the pregnant woman's peripheral blood was sampled; and a predictive model construction module, the predictive model construction module being connected to the predetermined parameter determination module, constructing the predictive model based on the known gestational status and the predetermined parameters. According to embodiments of the present invention, the system constructs a predictive model for the pregnant woman's pregnancy status based on the fetal cell-free nucleic acid concentration, gestational age at sampling, the pregnant woman's physical signs (e.g., height, weight, body mass index, age) and pregnancy status (e.g., preterm birth, gestational age at delivery) obtained from a single blood sample collection from multiple pregnant women. The device uses the fetal cell-free nucleic acid concentration and gestational age at sampling as two key factors as key parameters for constructing the model, thereby improving the accuracy of the constructed model.
[0019] According to embodiments of the present invention, the above method may further include at least one of the following additional technical features:
[0020] According to embodiments of the present invention, the pregnancy status includes the labor phase of the pregnant woman. The system according to embodiments of the present invention can predict the probability of preterm birth, gestational age at delivery, intrauterine growth restriction, and other pregnancy complications correlated with fetal cell-free nucleic acid concentration.
[0021] According to an embodiment of the present invention, the gestational age for blood sampling is 13 to 25 weeks. The inventors found that the correlation between fetal concentration and preterm birth is weaker when the gestational age for blood sampling is less than or equal to 12 weeks or between 26 and 30 weeks, while the correlation is stronger when the gestational age for blood sampling is between 13 and 25 weeks.
[0022] According to embodiments of the present invention, the prediction model can theoretically be any statistical model that generalizes to different differential distributions. In specific embodiments of the present invention, the prediction model is at least one of a linear regression model, a logistic regression model, and a random forest.
[0023] According to an embodiment of the present invention, the predetermined parameters further include the pregnant woman's height, weight, and age.
[0024] According to an embodiment of the present invention, the prediction model construction module is configured for the following formula:
[0025] l i =β0+β icff x icff+β isample x isample +β iheight x iheight +β iweight x iweight +β iage x iage +ε i Using the training set and validation set, determine β0, β... icff β isample β iheight β iweight β iage and ε i The value of , where i represents the number of the pregnant woman sample in the training set; l i It is a value determined based on the known gestational status of the i-th pregnant woman sample, wherein, for the preterm birth sample, l i For full-term babies, the value is 1. i x is 0; icff x represents the concentration of cell-free fetal nucleic acid in the sample of the pregnant woman described in number i; isample Indicates the gestational week of the blood sample taken from the i-th pregnant woman; x iheight x represents the height of the pregnant woman sample i; iweight x represents the weight of the pregnant woman sample i; iage ε represents the age of the pregnant woman sample mentioned in number i; i This indicates the sequencing error of peripheral blood from the i-th pregnant woman sample.
[0026] In a third aspect, the present invention provides a method for determining the gestational status of a pregnant woman. According to an embodiment of the present invention, the method includes: (1) determining predetermined parameters of the pregnant woman, the predetermined parameters including the concentration of fetal cell-free nucleic acid in the pregnant woman's peripheral blood and the gestational age at which the peripheral blood was sampled; and (2) determining the gestational status of the pregnant woman based on the predetermined parameters and the prediction model, the prediction model being constructed according to the method for establishing the prediction model. The method according to an embodiment of the present invention enables a single blood sample taken from a pregnant woman in early pregnancy, and rapid and accurate prediction of the pregnant woman's gestational status based on information on the concentration of fetal cell-free nucleic acid obtained from the pregnant woman's peripheral blood, the gestational age at which the blood was collected, and the pregnant woman's vital signs. The gestational status includes gestational age at delivery, probability of preterm birth, intrauterine growth restriction, and other pregnancy complications correlated with the concentration of fetal cell-free nucleic acid.
[0027] According to embodiments of the present invention, the above method may further include at least one of the following additional technical features:
[0028] According to embodiments of the present invention, the pregnancy status includes the labor range of the pregnant woman. The labor range refers to the gestational age at delivery. The method according to embodiments of the present invention can effectively predict the gestational age at delivery and the probability of preterm birth. In addition, the method according to embodiments of the present invention can also effectively predict pregnancy complications related to fetal cell-free nucleic acid concentration, such as the probability of preterm birth, gestational age at delivery, and intrauterine growth restriction.
[0029] According to an embodiment of the present invention, the gestational age for blood sampling is 13 to 25 weeks. The inventors found that the correlation between fetal concentration and preterm birth is weaker when the gestational age for blood sampling is less than or equal to 12 weeks or between 26 and 30 weeks, while the correlation is stronger when the gestational age for blood sampling is between 13 and 25 weeks.
[0030] According to embodiments of the present invention, the prediction model can theoretically be any statistical model that generalizes to different differential distributions. In a specific embodiment of the present invention, the predetermined prediction model is at least one of a linear regression model, a logistic regression model, and a random forest.
[0031] According to an embodiment of the present invention, the predetermined parameters further include the pregnant woman's height, weight, and / or age, and the prediction model is adapted to calculate the pregnant woman's delivery range based on the following formula:
[0032] l=β0+β cff x cff +β sample x sample +β height x height +β weight x weight +β age x age +ε, where l is a parameter determined based on the probability of premature birth in the pregnant woman; β0, β cff ,β sample ,β height ,β weight ε and x are independently predetermined coefficients; cff x represents the concentration of cell-free fetal nucleic acid in the pregnant woman. sample The gestational age at which the blood was collected from the pregnant woman; x height x represents the height of the pregnant woman. weight x represents the pregnant woman's weight; age The pregnant woman's age, ε i This refers to the sequencing error of the pregnant woman's peripheral blood sample. According to an embodiment of the present invention, β0, β... cff ,β sample ,β height ,β weightThe coefficients can be derived from a pre-set training set. One or more coefficients can be selected, or the pregnant woman's body mass index (BMI) can be added as one of the coefficients.
[0033] According to an embodiment of the present invention, l is determined based on the following formula: Where b is the base of log, usually taken as a constant e, and p is the probability of the pregnant woman having a premature birth.
[0034] In a fourth aspect, the present invention provides an apparatus for determining the gestational status of a pregnant woman. According to an embodiment of the invention, the apparatus includes: a parameter determination module for determining predetermined parameters of the pregnant woman, the predetermined parameters including the concentration of fetal cell-free nucleic acid in the pregnant woman's peripheral blood and the gestational age at which the peripheral blood was sampled; and a gestational status determination module connected to the parameter determination module for determining the gestational status of the pregnant woman based on the predetermined parameters and the prediction model. The apparatus according to the embodiment of the invention can rapidly and accurately predict the gestational status of a pregnant woman based on information on the concentration of fetal cell-free nucleic acid, the gestational age at which the blood was collected, and the pregnant woman's vital signs obtained from a single blood sample taken in early pregnancy. This includes gestational age at delivery, probability of preterm birth, intrauterine growth restriction, and other pregnancy complications correlated with the concentration of fetal cell-free nucleic acid.
[0035] According to embodiments of the present invention, the above-described apparatus may further have the following additional technical features:
[0036] According to embodiments of the present invention, the pregnancy status includes the labor phase of the pregnant woman. The method according to embodiments of the present invention can predict the probability of preterm birth, gestational age at delivery, intrauterine growth restriction, and other pregnancy complications correlated with fetal cell-free nucleic acid concentration.
[0037] According to an embodiment of the present invention, the gestational age for blood sampling is 13 to 25 weeks. The inventors found that the correlation between fetal concentration and preterm birth is weak when the gestational age for blood sampling is less than or equal to 12 weeks and between 26 and 30 weeks, while the correlation is strong when the gestational age for blood sampling is between 13 and 25 weeks.
[0038] According to an embodiment of the present invention, the predetermined prediction model is at least one of a linear regression model, a logistic regression model, and a random forest. According to a specific embodiment of the present invention, the prediction model can theoretically be any statistical model that generalizes to different differential distributions.
[0039] According to an embodiment of the present invention, the predetermined parameters further include the pregnant woman's height, weight, and age, and the prediction model is adapted to calculate the pregnant woman's delivery range based on the following formula:
[0040] l=β0+β cff x cff+β sample x sample +β height x height +β weight x weight +β age x age +ε where l is a parameter determined based on the probability of premature birth in the pregnant woman; β0, β cff ,β sample ,β height ,β weight ε and x are independently predetermined coefficients; cff x represents the concentration of cell-free fetal nucleic acid in the pregnant woman. sample The gestational age at which the blood was collected from the pregnant woman; x height x represents the height of the pregnant woman. weight x represents the pregnant woman's weight; age ε represents the pregnant woman's age; ε represents the sequencing error of the pregnant woman's peripheral blood sample. According to an embodiment of the present invention, β0, β... cff ,β sample ,β height ,β weight The coefficients can be freely selected as needed; for example, the pregnant woman's BMI can be added as one of the coefficients.
[0041] According to an embodiment of the present invention, l is determined based on the following formula: Where b is the base of log, usually taken as a constant e, and p is the probability of the pregnant woman having a premature birth.
[0042] In a fifth aspect of the invention, a computer-readable storage medium is provided having a computer program stored thereon that, when executed by a processor, implements the steps for constructing a predictive model as described above. Thus, the method for constructing a predictive model described above can be effectively implemented, thereby enabling the efficient construction of a predictive model, which can then be used to predict unknown samples to determine the gestational status of a pregnant woman to be tested.
[0043] In a sixth aspect, the present invention provides an electronic device comprising the aforementioned computer-readable storage medium; and one or more processors for executing a program in the computer-readable storage medium. Attached Figure Description
[0044] The above and / or additional aspects and advantages of the present invention will become apparent and readily understood from the description of the embodiments taken in conjunction with the following drawings, in which:
[0045] Figure 1 To illustrate the correlation between fetal cfDNA concentration and preterm birth at different gestational weeks of blood collection according to embodiments of the present invention;
[0046] Figure 2 This invention describes the changes in specificity, sensitivity, and accuracy when using a test dataset to predict preterm birth according to different preterm birth probability thresholds.
[0047] Figure 3 The distribution of the predicted gestational age at delivery and the actual gestational age at delivery according to an embodiment of the present invention;
[0048] Figure 4 This is a flowchart illustrating a method for constructing a prediction model according to an embodiment of the present invention;
[0049] Figure 5 A block diagram of a system for constructing a prediction model according to an embodiment of the present invention;
[0050] Figure 6 This is a flowchart illustrating a method for determining the pregnancy status of a pregnant woman according to an embodiment of the present invention.
[0051] Figure 7 This is a block diagram of an apparatus for determining the pregnancy status of a pregnant woman according to an embodiment of the present invention.
[0052] Detailed description of the invention
[0053] Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings. The embodiments described below with reference to the accompanying drawings are exemplary and intended to explain the present invention, and should not be construed as limiting the present invention.
[0054] Terminology Explanation
[0055] Unless otherwise specified, the terms "first," "second," "third," and similar terms used in this document are for the purpose of distinguishing for ease of description and do not imply or indicate any difference in order or importance among them, nor do they mean that the content defined by "first," "second," "third," or similar terms consists of only one component.
[0056] In this invention, unless otherwise explicitly specified and limited, the terms "installation," "connection," "linking," and "fixing," etc., should be interpreted broadly. For example, they can refer to a fixed connection, a detachable connection, or an integral part; they can refer to a mechanical connection or an electrical connection; they can refer to a direct connection or an indirect connection through an intermediate medium; they can refer to the internal communication of two components or the interaction between two components, unless otherwise explicitly limited. Those skilled in the art can understand the specific meaning of the above terms in this invention according to the specific circumstances.
[0057] According to one aspect of the present invention, a method for constructing a predictive model is provided. According to an embodiment of the present invention, reference is made to... Figure 4 The prediction model is used to determine the pregnancy status of a pregnant woman, including:
[0058] S1000, construct a training set and an optional validation set, wherein the training set and the validation set are each composed of multiple pregnant women samples, and the pregnant women samples have known pregnancy status;
[0059] S2000, for each of the pregnant women samples in the training set, determine predetermined parameters for the pregnant woman, the predetermined parameters including the concentration of cell-free fetal nucleic acid in the pregnant woman's peripheral blood and the gestational age at which the peripheral blood was sampled; and
[0060] S3000, Based on the known pregnancy status and the predetermined parameters, the prediction model is constructed. According to the method of this embodiment, a prediction model for the pregnancy status of a pregnant woman is constructed based on the fetal cell-free nucleic acid concentration, gestational age at sampling, maternal vital signs (e.g., height, weight, BMI, age) and pregnancy status (e.g., preterm birth, gestational age at delivery) obtained from a single blood sample collection from multiple pregnant women. This method includes two key factors: favorable fetal nucleic acid concentration and gestational age at sampling, thus improving the accuracy of the model. According to an embodiment of the present invention, the fetal cell-free nucleic acid concentration is obtained by processing cell-free nucleic acid sequencing data from maternal plasma as input data. Specifically, this includes: after quality control of the raw sequencing data (fq format), alignment software (such as samse mode in BWA) is used to align the sequencing data to a human reference chromosome; sequencing data quality control software (such as Picard) is used to remove duplicate reads from the alignment results and calculate the duplication rate; a variant detection algorithm (such as the BQSR function in GATK) is used to perform local correction of the alignment results; and coverage depth calculation software (such as the Depth of Coverage function in GATK) is used to calculate the average depth of different chromosomes in each sample. For male fetal samples, the average depth of coverage of unique aligned reads aligned to the non-homologous region of the Y chromosome is calculated. The ratio of this average depth to the average depth of unique aligned reads on autosomes is the fetal cell-free nucleic acid concentration. For female fetal samples, existing methods for calculating fetal cell-free nucleic acid concentration based on low-depth sequencing data from maternal plasma can be used.
[0061] According to a specific embodiment of the present invention, the method of the present invention selects pregnant women samples as training set and validation set, constructs a prediction model based on the known pregnancy status, fetal cell-free nucleic acid concentration, height, weight, age, BMI and gestational week (13-25 weeks) at the time of blood collection in the training set, and then determines the magnitude of each fixed coefficient in the prediction model formula to predict the pregnancy status of the pregnant women to be tested.
[0062] According to embodiments of the present invention, the pregnancy status includes the labor phase of the pregnant woman. The method according to embodiments of the present invention can predict the probability of preterm birth, gestational age at delivery, intrauterine growth restriction, and other pregnancy complications correlated with fetal cell-free nucleic acid concentration.
[0063] According to embodiments of the present invention, the gestational age for sampling is 13-25 weeks. The inventors found that the correlation between fetal concentration and preterm birth is weak when the gestational age for blood sampling is less than or equal to 12 weeks and between 26-30 weeks, while the correlation is strong when the gestational age for blood sampling is between 13 and 25 weeks. Typically, using fetal cell-free nucleic acid concentration to predict the gestational status of pregnant women has problems such as weak correlation. According to the method of the present invention, incorporating the gestational age for sampling as one of the parameters for constructing the predictive model improves the accuracy of the prediction. Different pregnant women only need to have their blood drawn once within the gestational age of 13-25 weeks to become samples for model construction, avoiding the risks and costs associated with repeated blood sampling during the sample collection process.
[0064] According to embodiments of the present invention, the prediction model is at least one of a linear regression model, a logistic regression model, and a random forest. According to embodiments of the present invention, the prediction model can theoretically be any statistical model that generalizes to different differential distributions.
[0065] According to an embodiment of the present invention, the predetermined parameters further include the pregnant woman's height, weight, and age.
[0066] According to an embodiment of the present invention, in step (iii), the following formula is used:
[0067] l i =β0+β icff x icff +β isample x isample +β iheight x iheight +β iweight x iweight +β iage x iage +ε i For i = 1, ..., p, using the training set and validation set, determine β0, β... icff β isample β iheight β iweight β iage and ε i The value of , where i represents the number of the pregnant woman sample in the training set; l i It is a value determined based on the known gestational status of the i-th pregnant woman sample, wherein, for the preterm birth sample, l iFor full-term babies, the value is 1. i x is 0; icff x represents the concentration of cell-free fetal nucleic acid in the sample of the pregnant woman described in number i; isample Indicates the gestational week of the blood sample taken from the i-th pregnant woman; x iheight x represents the height of the pregnant woman sample i; iweight x represents the weight of the pregnant woman sample i; iage ε represents the age of the pregnant woman sample mentioned in number i; i This represents the sequencing error of the peripheral blood sample from the pregnant woman in sample i. It should be noted that ε is the random error generated by the sequencer during the sequencing process. This value is related to the sequencing batch but not to the individual pregnant woman sample; it is generated directly by the sequencer when the sequencing data is processed.
[0068] In a second aspect, the present invention proposes a system for constructing a predictive model. According to an embodiment of the invention, the predictive model is used to determine the pregnancy status of a pregnant woman, with reference to... Figure 5 The device includes: a training set construction module 1000, the training set consisting of multiple pregnant women samples, the pregnant women samples having known gestational status; a predetermined parameter determination module 2000, the predetermined parameter determination module 2000 being connected to the training set construction module 1000, for each pregnant woman sample in the training set, determining predetermined parameters for the pregnant woman, the predetermined parameters including the fetal cell-free nucleic acid concentration in the pregnant woman's peripheral blood and the gestational age at which the pregnant woman's peripheral blood was sampled; and a prediction model construction module 3000, the prediction model construction module 3000 being connected to the predetermined parameter determination module 2000, constructing the prediction model based on the known gestational status and the predetermined parameters. According to an embodiment of the present invention, the system constructs a predictive model for the pregnant woman's pregnancy status based on the fetal cell-free nucleic acid concentration, gestational age at sampling, the pregnant woman's physical signs (e.g., height, weight, BMI, age) and pregnancy status (e.g., preterm birth, gestational age at delivery) obtained from a single blood sample collection from multiple pregnant women. The device uses the two key factors of favorable fetal nucleic acid concentration and gestational age at sampling as key parameters for constructing the model, thereby improving the accuracy of the constructed model.
[0069] According to embodiments of the present invention, the pregnancy status includes the labor phase of the pregnant woman. The method according to embodiments of the present invention can predict the probability of preterm birth, gestational age at delivery, intrauterine growth restriction, and other pregnancy complications correlated with fetal cell-free nucleic acid concentration.
[0070] According to an embodiment of the present invention, the gestational age for sampling is 13-25 weeks. The inventors found that the correlation between fetal concentration and preterm birth is weak when the gestational age for blood sampling is less than or equal to 12 weeks and between 26-30 weeks, while the correlation is strong when the gestational age for blood sampling is between 13 and 25 weeks. Typically, using fetal cell-free nucleic acid concentration to predict the gestational status of pregnant women has problems such as weak correlation. According to the system of the present invention, incorporating the gestational age for sampling as one of the parameters for constructing the prediction model improves the accuracy of the prediction. Different pregnant women only need to have their blood drawn once within the gestational age of 13-25 weeks to become samples for model construction, avoiding the risks and costs associated with repeated blood sampling during the sample collection process.
[0071] According to embodiments of the present invention, the prediction model is at least one of a linear regression model, a logistic regression model, and a random forest. According to the system of the present invention, the prediction model can theoretically be any statistical model that generalizes to different differential distributions.
[0072] According to an embodiment of the present invention, the predetermined parameters further include the pregnant woman's height, weight, and age.
[0073] According to an embodiment of the present invention, the prediction model construction module is configured for the following formula:
[0074] For the following formula:
[0075] l i =β0+β icff x icff +β isample x isample +β iheight x iheight +β iweight x iweight +β iage x iage +ε i i = 1, ..., p
[0076] Using the training and validation sets, determine β0 and β2. icff β isample β iheight β iweight β iage and ε i The value of , where i represents the number of the pregnant woman sample in the training set; l i It is a value determined based on the known gestational status of the i-th pregnant woman sample, wherein, for the preterm birth sample, l i For full-term babies, the value is 1. i x is 0; icff x represents the concentration of cell-free fetal nucleic acid in the sample of the pregnant woman described in number i;isample Indicates the gestational week of the blood sample taken from the i-th pregnant woman; x iheight x represents the height of the pregnant woman sample i; iweight x represents the weight of the pregnant woman sample i; iage ε represents the age of the pregnant woman sample mentioned in number i; i This indicates the sequencing error of peripheral blood from the i-th pregnant woman sample.
[0077] In a third aspect, the present invention provides a method for determining the gestational status of a pregnant woman. According to embodiments of the present invention, reference is made to… Figure 6 The method includes:
[0078] S100 determines predetermined parameters for the pregnant woman, the predetermined parameters including the concentration of cell-free fetal nucleic acid in the pregnant woman's peripheral blood and the gestational age at which the peripheral blood was sampled; and
[0079] S200 determines the gestational status of the pregnant woman based on the predetermined parameters and the prediction model. According to the method of this embodiment, the fetal cell-free nucleic acid concentration is obtained by processing the cell-free nucleic acid sequencing data in the pregnant woman's plasma as input data. Specifically, this includes: after quality control of the raw sequencing data (fq format), aligning the sequencing data to a human reference chromosome using alignment software (such as samse mode in BWA); using sequencing data quality control software (such as Picard) to remove duplicate reads in the alignment results and calculate the duplication rate; using a variant detection algorithm (such as the base quality value correction BQSR function in GATK) to perform local correction of the alignment results; and using coverage depth calculation software (such as the Depth of Coverage function in GATK) to calculate the average depth of different chromosomes in each sample. For male fetal samples, the average depth of coverage of the unique aligned reads aligned to the non-homologous region of the Y chromosome is calculated. The ratio of this average depth to the average depth of the unique aligned reads on autosomes is the fetal cell-free nucleic acid concentration. For female fetal samples, existing methods for calculating fetal cell-free nucleic acid concentration based on low-depth sequencing data of maternal plasma can be used.
[0080] According to embodiments of the present invention, the pregnancy status includes the labor phase of the pregnant woman. The method according to embodiments of the present invention can predict the probability of preterm birth, gestational age at delivery, intrauterine growth restriction, and other pregnancy complications correlated with fetal cell-free nucleic acid concentration.
[0081] According to an embodiment of the present invention, the gestational age for sampling is 13-25 weeks. The inventors found that the correlation between fetal concentration and preterm birth is weak when the gestational age for blood sampling is less than or equal to 12 weeks and between 26-30 weeks, while the correlation is strong when the gestational age for blood sampling is between 13 and 25 weeks. Typically, using fetal cell-free nucleic acid concentration to predict the gestational status of pregnant women has problems such as weak correlation. According to the method of the present invention, incorporating the gestational age for sampling as one of the parameters in constructing the prediction model improves the accuracy of the prediction. Furthermore, pregnant women only need to undergo one blood sample within the 13-25 week gestational period for prediction, reducing the cost and risk of multiple blood samplings.
[0082] According to embodiments of the present invention, the predetermined prediction model is at least one of a linear regression model, a logistic regression model, and a random forest. According to embodiments of the present invention, the prediction model can theoretically be any statistical model that generalizes to different differential distributions.
[0083] According to a specific embodiment of the present invention, the method establishes a predictive model based on the known gestational status of the pregnant woman's sample, fetal cell-free nucleic acid concentration, height, weight, age, BMI, and gestational age at blood collection (13-25 weeks). The method also determines the values of fixed coefficients in the predictive model formula to predict the gestational status of the pregnant woman to be tested. At 13-25 weeks of gestation, peripheral blood is collected from the pregnant woman to be tested, and the fetal cell-free nucleic acid concentration is measured. The fetal cell-free nucleic acid concentration, height, weight, age, BMI, and gestational age information of the pregnant woman to be tested are input into the predictive model to obtain the predicted gestational status information of the pregnant woman to be tested.
[0084] According to an embodiment of the present invention, the predetermined parameters further include the pregnant woman's height, weight, and age, and the prediction model is adapted to calculate the pregnant woman's delivery range based on the following formula:
[0085] l=β0+β cff x cff +β sample x sample +β height x height +β weight x weight +β age x age +ε, where l is a parameter determined based on the probability of premature birth in the pregnant woman; β0, β cff ,β sample ,β height ,β weight ε and x are independently predetermined coefficients; cff x represents the concentration of cell-free fetal nucleic acid in the pregnant woman. sample The gestational age at which the blood was collected from the pregnant woman; x heightx represents the height of the pregnant woman. weight x represents the pregnant woman's weight; age The pregnant woman's age; ε i This refers to the sequencing error of the pregnant woman's peripheral blood sample. According to the method of an embodiment of the present invention, β0, β... cff ,β sample ,β height ,β weight The coefficients can be freely selected as needed; for example, the pregnant woman's BMI can be added as one of the coefficients.
[0086] According to an embodiment of the present invention, l is determined based on the following formula: Where b is the base of log, usually taken as a constant e; P is the probability of premature birth in the pregnant woman.
[0087] In a fourth aspect, the present invention provides an apparatus for determining the gestational status of a pregnant woman, according to an embodiment of the invention, with reference to Figure 7 The device includes: a parameter determination module 100, used to determine predetermined parameters of the pregnant woman, the predetermined parameters including the fetal cell-free nucleic acid concentration in the pregnant woman's peripheral blood and the gestational age at which the peripheral blood was sampled; and a pregnancy status determination module 200, connected to the parameter determination module 100, used to determine the pregnancy status of the pregnant woman based on the predetermined parameters and the prediction model. According to the device of the present invention, the device can quickly and accurately predict the pregnancy status of a pregnant woman based on fetal cell-free nucleic acid concentration information, gestational age at blood collection, and the pregnant woman's vital signs data obtained from a single blood sample taken in early pregnancy, including gestational age at delivery, probability of preterm birth, intrauterine growth restriction, and other pregnancy complications correlated with fetal cell-free nucleic acid concentration. According to the apparatus of this invention, the fetal cell-free nucleic acid concentration is obtained by processing cell-free nucleic acid sequencing data from maternal plasma as input data. Specifically, this includes: after quality control of the raw sequencing data (fq format), aligning the sequencing data to a human reference chromosome using alignment software (such as samse mode in BWA); using sequencing data quality control software (such as Picard) to remove duplicate reads from the alignment results and calculate the duplication rate; using a variant detection algorithm (such as the BQSR function in GATK) to perform local correction of the alignment results; and using coverage depth calculation software (such as the Depth of Coverage function in GATK) to calculate the average depth of different chromosomes for each sample. For male fetal samples, the average depth of coverage of unique aligned reads aligned to the non-homologous region of the Y chromosome is calculated. The ratio of this average depth to the average depth of unique aligned reads on autosomes is the fetal cell-free nucleic acid concentration. For female fetal samples, existing methods for calculating fetal concentration based on low-depth sequencing data from maternal plasma can be used.
[0088] According to embodiments of the present invention, the pregnancy status includes the labor phase of the pregnant woman. The device according to embodiments of the present invention can predict the probability of preterm birth, gestational age at delivery, intrauterine growth restriction, and other pregnancy complications correlated with fetal cell-free nucleic acid concentration.
[0089] According to an embodiment of the present invention, the gestational age for sampling is 13-25 weeks. The inventors found that the correlation between fetal concentration and preterm birth is weak when the gestational age for blood sampling is less than or equal to 12 weeks and between 26-30 weeks, while the correlation is strong when the gestational age for blood sampling is between 13 and 25 weeks. Typically, using fetal cell-free nucleic acid concentration to predict the gestational status of pregnant women has problems such as weak correlation. The device according to the present invention incorporates the gestational age for sampling as one of the parameters for constructing the prediction model, improving the accuracy of the prediction. Furthermore, pregnant women only need to undergo one blood sample within the 13-25 week gestational period for prediction, reducing the cost and risk of multiple blood samplings.
[0090] According to embodiments of the present invention, the predetermined prediction model is at least one of a linear regression model, a logistic regression model, and a random forest. According to the apparatus of the present invention, the prediction model can theoretically be any statistical model that generalizes to different differential distributions.
[0091] According to an embodiment of the present invention, the predetermined parameters further include the pregnant woman's height, weight, and age, and the prediction model is adapted to calculate the pregnant woman's delivery range based on the following formula:
[0092] l=β0+β cff x cff +β sample x sample +β height x height +β weight x weight +β age x age +ε where l is a parameter determined based on the probability of premature birth in the pregnant woman; β0, β cff ,β sample ,β height ,β weight ε and x are independently predetermined coefficients; cff x represents the concentration of cell-free fetal nucleic acid in the pregnant woman. sample The gestational age at which the blood was collected from the pregnant woman; x height x represents the height of the pregnant woman. weight x represents the pregnant woman's weight; age ε represents the pregnant woman's age; ε represents the sequencing error of the pregnant woman's peripheral blood sample. According to an embodiment of the present invention, β0, β... cff ,βsample ,β height ,β weight The coefficients can be freely selected as needed; for example, the pregnant woman's BMI can be added as one of the coefficients.
[0093] According to an embodiment of the present invention, l is determined based on the following formula: Where b is the base of log, usually taken as a constant e; p is the probability of premature birth in the pregnant woman.
[0094] In a fifth aspect of the invention, a computer-readable storage medium is provided having a computer program stored thereon that, when executed by a processor, implements the steps for constructing a predictive model as described above. Thus, the method for constructing a predictive model described above can be effectively implemented, thereby enabling the efficient construction of a predictive model, which can then be used to predict unknown samples to determine the gestational status of a pregnant woman to be tested.
[0095] In a sixth aspect, the present invention provides an electronic device comprising the computer-readable storage medium and one or more processors for executing a program in the computer-readable storage medium.
[0096] The present invention will be further explained and described below with reference to specific embodiments. Unless otherwise specified, the experimental methods used in the following embodiments are conventional methods. Unless otherwise specified, the materials and reagents used in the following embodiments are commercially available.
[0097] The present invention will be explained below with reference to embodiments. Those skilled in the art will understand that the following embodiments are for illustrative purposes only and should not be considered as limiting the scope of the invention. Where specific techniques or conditions are not specified in the embodiments, they shall be performed in accordance with the techniques or conditions described in the literature in the art (e.g., refer to J. Sambrook et al., *Molecular Cloning: A Laboratory Manual*, 3rd edition, Science Press, translated by Huang Peitang et al.) or according to the product instructions. Reagents or instruments whose manufacturers are not specified are all commercially available conventional products, such as those purchased from Illumina.
[0098] Example 1: Establishment and Application of Preterm Birth and Gestational Age Prediction Model
[0099] The 38,964 samples were categorized according to different gestational weeks of blood collection, and the correlation between plasma fetal cfDNA concentration and preterm birth was calculated. (See attached figure.) Figure 1 Statistical analysis revealed that the correlation between fetal concentration and preterm birth varied at different gestational weeks of blood collection. The correlation between fetal concentration and preterm birth was weaker when the gestational week of blood collection was less than or equal to 12 weeks and between 26 and 30 weeks, while the correlation was stronger when the gestational week of blood collection was between 13 and 25 weeks.
[0100] The training set was composed of plasma cfDNA data from 38,964 pregnant women, combined with information on gestational age, maternal age, height, and weight.
[0101] (1) In the prediction of gestational age at delivery, gestational age at delivery is treated as a continuous variable, and a linear regression model is established.
[0102] Specifically, using gestational age at delivery as the Y-value, and incorporating fetal cfDNA concentration, gestational age at blood collection, maternal height, weight, age, and BMI as covariates, a predictive model was established:
[0103] y i =β0+β icff x icff +β isample x isample +β iheight x iheight +β iweight x iweight +β iage x iage +ε i i = 1, ..., p
[0104] Among them, y i Let x be the gestational week of delivery corresponding to sample i. icff x represents the fetal cfDNA concentration corresponding to sample i. isample x represents the gestational week corresponding to blood collection for sample i. iheight Let x be the height of the pregnant woman corresponding to sample i. iweight Let x be the weight of the pregnant woman corresponding to sample i. iage Let x be the age of the pregnant woman corresponding to sample i. ibmi Let be the BMI of the pregnant woman corresponding to sample i, and p be the total number of samples in the training set, where p = 38964.
[0105] The estimated values of the coefficients β for different variables in the final prediction model are shown in the first gestational week column of the delivery section in Table 2.
[0106] (2) In the prediction of preterm birth, the preterm birth event is defined as Y=0 and the full-term event is defined as Y=1, and a logistic regression model is established.
[0107] Specifically, the probability of a sample being full-term is set as p = P(Y = 1), and the probability of preterm birth is p = P(Y = 0). This probability p is then transformed using the log-odds transformation, i.e.
[0108]
[0109] Where b is the base of the logarithm, which is usually taken as a constant e.
[0110] Substituting the transformed l into the linear regression model, and similarly using fetal cfDNA concentration, gestational age at blood collection, maternal height, weight, and age as covariates, a predictive model was established:
[0111] Specifically, using gestational age at delivery as the Y-value, and incorporating fetal cfDNA concentration, gestational age at blood collection, maternal height, weight, age, and BMI as covariates, a predictive model was established:
[0112] l i =β0+β icff x icff +β isample x isample +β iheight x iheight +β iweight x iweight +β iage x iage +ε i i = 1, ..., p
[0113] Among them, l i x is the logical transformation result of the delivery gestational week corresponding to sample i. icff x represents the fetal cfDNA concentration corresponding to sample i. isample x represents the gestational week corresponding to blood collection for sample i. iheight Let x be the height of the pregnant woman corresponding to sample i. iweight Let x be the weight of the pregnant woman corresponding to sample i. iage Let x be the age of the pregnant woman corresponding to sample i. ibmi Let be the BMI of the pregnant woman corresponding to sample i, and p be the total number of samples in the training set, where p = 38964.
[0114] The estimated values of the coefficient β for different variables in the final prediction model are shown in the preterm birth column of Table 1.
[0115] Table 1. Statistical results of maternal phenotype-related data in the regression models of gestational age at delivery and preterm birth.
[0116]
[0117] After obtaining the predictive models for preterm birth and gestational age at delivery, an additional 32,049 samples were used as a test set. The fetal concentration, gestational age at blood collection, maternal age, height, weight, and BMI of each sample were respectively substituted into the linear regression model to predict gestational age at delivery and into the logistic regression model to predict preterm birth.
[0118] Reference Appendix for the Accuracy of the Final Preterm Birth Prediction Results Figure 2 The distribution of predicted gestational age at delivery versus actual gestational age at delivery is shown in the attached table. Figure 3Among them, the preterm birth prediction results were significantly correlated with the actual results, with a correlation of -0.13. The probability threshold for filtering can be determined according to the sensitivity and specificity requirements of the actual scenario. The correlation between the predicted gestational age of delivery and the actual gestational age of delivery reached 0.12.
[0119] In the description of this specification, the references to terms such as "one embodiment," "some embodiments," "example," "specific example," or "some examples," etc., indicate that a specific feature, structure, material, or characteristic described in connection with that embodiment or example is included in at least one embodiment or example of the present invention. In this specification, the illustrative expressions of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the specific features, structures, materials, or characteristics described may be combined in any suitable manner in one or more embodiments or examples. Moreover, without contradiction, those skilled in the art can combine and integrate the different embodiments or examples described in this specification, as well as the features of different embodiments or examples.
[0120] Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention. Those skilled in the art can make changes, modifications, substitutions and variations to the above embodiments within the scope of the present invention.
Claims
1. A method for constructing a predictive model, characterized in that, The prediction model is used to determine the pregnancy status of pregnant women, including: (i) Construct a training set and an optional validation set, wherein the training set and the optional validation set are each composed of multiple pregnant women samples, wherein the pregnant women samples have known pregnancy status; (ii) For each of the pregnant women samples in the training set, determine predetermined parameters for the pregnant women samples, the predetermined parameters including the concentration of cell-free fetal nucleic acid in the pregnant women's peripheral blood and the gestational age at which the pregnant women's peripheral blood was sampled; and (iii) Construct the prediction model based on the known pregnancy status and the predetermined parameters; The prediction model is at least one of linear regression, logistic regression, and random forest; the gestational age for sampling is 13-25 weeks.
2. The method according to claim 1, characterized in that, The pregnancy status includes the pregnant woman's delivery zone.
3. The method according to claim 1, characterized in that, The predetermined parameters further include the pregnant woman's height, weight, and / or age.
4. The method according to claim 1, characterized in that, In step (iii), the following formula is applied: Using the training set and an optional validation set, determine , The value of, where, This represents the ID number of the pregnant woman sample in the training set; It is aimed at the first The numerical values determined for the known gestational status of the pregnant women sample mentioned above, wherein, for the pregnant women sample with preterm birth... The value is 1, referring to the sample of full-term pregnant women. =0; Indicates the first The concentration of cell-free fetal nucleic acid in the pregnant woman's sample described in the document; Indicates the first The gestational age at which the blood sample from the pregnant woman described in the document was collected; Indicates the first The height of the pregnant woman sample mentioned in the document; Indicates the first The weight of the pregnant woman sample mentioned in the document; Indicates the first The age of the pregnant woman sample mentioned in the document; Indicates the first Sequencing error in peripheral blood samples from pregnant women mentioned in the report.
5. A system for constructing a predictive model, characterized in that, The prediction model is used to determine the pregnancy status of pregnant women, including: A training set construction module, wherein the training set consists of multiple pregnant women samples, and the pregnant women samples have known pregnancy status; A predetermined parameter determination module, connected to the training set construction module, determines predetermined parameters for each pregnant woman sample in the training set. These predetermined parameters include the concentration of cell-free fetal nucleic acid in the pregnant woman's peripheral blood and the gestational age at which the peripheral blood was sampled. A prediction model building module, which is connected to the predetermined parameter determination module, builds the prediction model based on the known pregnancy status and the predetermined parameters; The prediction model is at least one of linear regression, logistic regression, and random forest; the gestational age for sampling is 13-25 weeks.
6. The system according to claim 5, characterized in that, The pregnancy status includes the pregnant woman's delivery zone.
7. The system according to claim 5, wherein the predetermined parameters further include the pregnant woman's height, weight, and age.
8. The system according to claim 5, characterized in that, The prediction model construction module is designed for the following formula: Using the training and validation sets, determine , The value of, where, This represents the ID number of the pregnant woman sample in the training set; It is for the purpose of the first The numerical values determined for the known gestational status of the pregnant women sample mentioned above, wherein, for the pregnant women sample with preterm birth... The value is 1, for full-term babies. =0; Indicates the first The concentration of cell-free fetal nucleic acid in the pregnant woman's sample described in the document; Indicates the first The gestational age at which the blood sample from the pregnant woman described in the document was collected; Indicates the first The height of the pregnant woman sample mentioned in the document; Indicates the first The weight of the pregnant woman sample mentioned in the document; Indicates the first The age of the pregnant woman sample mentioned in the document; Indicates the first Sequencing error in peripheral blood samples from pregnant women mentioned in the report.
9. A device for determining the pregnancy status of a pregnant woman, characterized in that, include: A parameter determination module is used to determine predetermined parameters of the pregnant woman, including the concentration of fetal cell-free nucleic acid in the pregnant woman's peripheral blood and the gestational age at which the pregnant woman's peripheral blood was sampled. as well as A pregnancy status determination module, which is connected to the parameter determination module, is used to determine the pregnancy status of the pregnant woman based on the predetermined parameters and the prediction model constructed by the method described in any one of claims 1 to 4 or the system described in any one of claims 5 to 8. The prediction model is at least one of linear regression, logistic regression, and random forest; the gestational age for sampling is 13-25 weeks.
10. The apparatus according to claim 9, characterized in that, The pregnancy status includes the pregnant woman's delivery zone.
11. The apparatus according to claim 9, characterized in that, The predetermined parameters further include the pregnant woman's height, weight, and / or age, and the prediction model is adapted to calculate the pregnant woman's delivery range based on the following formula: in, These are parameters determined based on the probability of premature birth in the pregnant woman. , , , , and Each is a predetermined coefficient, independent of the others. The concentration of cell-free fetal nucleic acid in the pregnant woman; The gestational age at which the blood was drawn from the pregnant woman; The height of the pregnant woman; The weight of the pregnant woman; The pregnant woman's age; This refers to the sequencing error of the pregnant woman's peripheral blood sample.
12. The apparatus according to claim 9, characterized in that, It is determined based on the following formula: in, b is the base of the logarithm, which is usually taken as a constant e; p represents the probability that the pregnant woman will have a premature birth.
13. A computer-readable storage medium having a computer program stored thereon, characterized in that, When executed by a processor, the program implements the steps of the method described in any one of claims 1-4.
14. An electronic device, characterized in that, include: The computer-readable storage medium as described in claim 13; as well as One or more processors for executing a program in the computer-readable storage medium.