An ischemic stroke risk assessment model fusing environmental exposomics and metabolomics biomarkers and a model construction method
By integrating environmental exocomics and metabolomics biomarkers into an ischemic stroke risk assessment model, key pollutants are identified and their impact on stroke risk is quantified. This solves the problem that existing technologies cannot adequately reflect the co-exposure patterns of multiple pollutants, enabling more accurate risk assessment and prevention strategies.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SOUTH CHINA INST OF ENVIRONMENTAL SCI MEP
- Filing Date
- 2026-04-14
- Publication Date
- 2026-06-19
AI Technical Summary
Existing research is insufficient to fully reflect the complex patterns of co-exposure to multiple pollutants in the real environment, and the specific biological pathways by which pollutant exposure leads to stroke are not yet clear, which limits the development of targeted prevention strategies.
By integrating environmental exposure omics and metabolomics biomarkers, a risk assessment model for ischemic stroke was constructed. A machine learning model was used to identify key pollutants and quantify their impact on stroke risk. Training and validation sample sets were established by combining urine and plasma biomarkers, and the primary diagnostic model with the highest accuracy was selected as the final prediction model.
This systematic assessment evaluates the association between exposure to mixed organic pollutants and the risk of ischemic stroke, reveals the mediating role of metabolic pathway disorders in pollutant-induced stroke risk, provides a new tool for precise disease prevention, and improves the comprehensiveness and accuracy of risk assessment.
Smart Images

Figure CN122050846B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of environmental biological monitoring and risk prediction, specifically involving a risk assessment model for ischemic stroke that integrates environmental exposure omics and metabolomics biomarkers, and a method for model construction. Background Technology
[0002] Stroke is the leading cause of disability and death in adults worldwide, with ischemic stroke accounting for the majority of new cases. Given its heavy disease burden, ischemic stroke has become a key target for non-communicable disease prevention and control. Although traditional risk factors such as hypertension, obesity, abnormal glucose and lipid metabolism, and unhealthy behaviors are widely recognized and can prevent stroke to some extent through intervention, a significant portion of the risk cannot be fully explained by these traditional factors. Therefore, identifying and quantifying novel risk factors outside the traditional risk factor framework has become a key scientific issue for improving the effectiveness of primary stroke prevention and achieving personalized and accurate risk assessment.
[0003] Recent studies have shown that environmental exposure contributes significantly to the burden of chronic diseases, with chemical pollutant-related deaths accounting for a large proportion of global mortality, surpassing the impact of genetic factors. The view that environmental pollution is a major contributing factor to ischemic stroke is gaining increasing support. Epidemiological evidence shows that ambient air pollutants (such as fine particulate matter and nitrogen dioxide) are clearly independent risk factors for stroke morbidity, hospitalization, and death. Meanwhile, the association between exposure to organic pollutants (including volatile organic compounds, polycyclic aromatic hydrocarbons, perfluoroalkyl substances, polychlorinated biphenyls, and bisphenols) and stroke risk is also receiving increasing attention, with related population studies accumulating evidence. Living near persistent organic pollutant sources has been proven to increase the risk of ischemic stroke. However, existing studies are mostly limited to evaluating the effects of single or a few pollutants, failing to truly reflect the complex patterns of co-exposure to multiple pollutants in the actual environment; more importantly, the specific biological pathways by which pollutant exposure leads to stroke are not yet clear, hindering the development of targeted prevention strategies.
[0004] In real life, long-term, low-level co-exposure to multiple pollutants can synergistically produce adverse health effects, even if the concentration of each pollutant is below its individual safety limit. However, the current chemical risk assessment and regulatory system is mainly based on single compounds, leaving a significant gap in the assessment of the health risks of complex mixtures. Summary of the Invention
[0005] Based on the screening and identification of environmental exposure omics and metabolomics biomarkers associated with ischemic stroke, this invention provides an ischemic stroke risk assessment model and model construction method that integrates environmental exposure omics and metabolomics biomarkers, improving the comprehensiveness, accuracy and convenience of prediction.
[0006] The objective of this invention is achieved through the following technical solution:
[0007] Biomarkers of environmental exposure associated with ischemic stroke mainly include mono(2-ethyl-5-hydroxyhexyl) phthalate (MEHHP), N-acetyl-S-(3-hydroxypropyl)-L-cysteine (3-HPMA), D-neopterin (D-Neop), 2-hydroxynaphthalene (2-OHNap), 2,4,6-tribromophenol (2,4,6-Tri-BP), 3-hydroxyfluorene (3-OHFlu), trifluoroacetic acid (TFA), trifluoromethanesulfonic acid (TFMS), 1-hydroxypyrene (1-OHPyr), and mandelic acid (MA).
[0008] Plasma metabolomic biomarkers associated with ischemic stroke mainly include phenylethylamine, L-(+)-leucine, N,N-dimethylarginine, dodecylcarnitine, (9Z)-9-octadecenamide, palmitamide (hexadecamide), isopentenyl adenosine, gallic acid, γ-glutamyl alanine and pregabalin.
[0009] Biomarkers for ischemic stroke risk that integrate environmental exposure omics and metabolomics mainly include phenethylamine, L-(+)-leucine, dodecanoic acid, N,N-dimethylarginine, taurine, (9Z)-9-octadecenoamide, linolenic acid, 3-hydroxyfluorene, caffeine, and mono(2-ethyl-5-hydroxyhexyl) phthalate.
[0010] A method for constructing an ischemic stroke risk assessment model includes the following steps:
[0011] S1: Obtain multiple biomarkers in urine and plasma from the case group and the healthy control group;
[0012] The biomarkers mentioned are biomarkers that integrate environmental exocomics and metabolomics, including phenethylamine, L-(+)-leucine, dodecanoic acid, N,N-dimethylarginine, taurine, (9Z)-9-octadecenoamide, linolenic acid, 3-hydroxyfluorene, caffeine and mono(2-ethyl-5-hydroxyhexyl) phthalate;
[0013] The biomarkers mentioned can also be biomarkers from the environmental exposure group or biomarkers from the plasma metabolomics group;
[0014] The environmental exposure biomarkers include mono(2-ethyl-5-hydroxyhexyl) phthalate (MEHHP), N-acetyl-S-(3-hydroxypropyl)-L-cysteine (3-HPMA), D-neopterin (D-Neop), 2-hydroxynaphthalene (2-OHNap), 2,4,6-tribromophenol (2,4,6-Tri-BP), 3-hydroxyfluorene (3-OHFlu), trifluoroacetic acid (TFA), trifluoromethanesulfonic acid (TFMS), 1-hydroxypyrene (1-OHPyr), and mandelic acid (MA);
[0015] The plasma metabolomics biomarkers mentioned include phenylethylamine, L-(+)-leucine, N,N-dimethylarginine, dodecylcarnitine, (9Z)-9-octadecenamide, palmitamide (hexadecamide), isopentenyl adenosine, gallic acid, γ-glutamyl alanine and pregabalin;
[0016] S2: Establish training and validation sample sets based on various biomarkers; (the training samples in each training sample set belong to all biomarkers measured in urine and plasma)
[0017] S3: Train machine learning models using training sample sets corresponding to various biomarkers to obtain various primary diagnostic models; wherein, during the training process, the concentration of biomarkers is used as the input feature of the machine learning model, and the disease status of the objects collected by the biomarkers is used as the label; the disease status includes those who are diseased and those who are not diseased;
[0018] The machine learning model mentioned can be a random forest model, support vector machine, neural network model, or others;
[0019] S4: For various primary diagnostic models, the accuracy is verified using validation sample sets corresponding to various biomarkers;
[0020] S5: Select the primary diagnostic model with the highest accuracy as the final exposure risk prediction model.
[0021] A device for an ischemic stroke risk assessment model, comprising:
[0022] The acquisition module is used to acquire multiple biomarkers in urine and plasma;
[0023] The biomarkers mentioned are biomarkers that integrate environmental exocomics and metabolomics, including phenethylamine, L-(+)-leucine, dodecanoic acid, N,N-dimethylarginine, taurine, (9Z)-9-octadecenoamide, linolenic acid, 3-hydroxyfluorene, caffeine and mono(2-ethyl-5-hydroxyhexyl) phthalate;
[0024] The biomarkers mentioned can also be biomarkers from the environmental exposure group or biomarkers from the plasma metabolomics group;
[0025] The environmental exposure biomarkers include mono(2-ethyl-5-hydroxyhexyl) phthalate (MEHHP), N-acetyl-S-(3-hydroxypropyl)-L-cysteine (3-HPMA), D-neopterin (D-Neop), 2-hydroxynaphthalene (2-OHNap), 2,4,6-tribromophenol (2,4,6-Tri-BP), 3-hydroxyfluorene (3-OHFlu), trifluoroacetic acid (TFA), trifluoromethanesulfonic acid (TFMS), 1-hydroxypyrene (1-OHPyr), and mandelic acid (MA);
[0026] The plasma metabolomics biomarkers mentioned include phenylethylamine, L-(+)-leucine, N,N-dimethylarginine, dodecylcarnitine, (9Z)-9-octadecenamide, palmitamide (hexadecamide), isopentenyl adenosine, gallic acid, γ-glutamyl alanine and pregabalin;
[0027] The sample construction module is used to build training and validation sample sets based on various biomarkers.
[0028] The training module trains machine learning models using training sample sets corresponding to various biomarkers to obtain various primary diagnostic models. During training, the concentration of biomarkers is used as the input feature of the machine learning model, and the disease status of the objects collected by the biomarkers is used as the label. The disease status includes both diseased and non-diseased.
[0029] The machine learning model mentioned can be a random forest model, support vector machine, neural network model, or others;
[0030] Validation module: For various primary diagnostic models, the accuracy is verified using validation sample sets corresponding to various biomarkers;
[0031] Select Module: Choose the primary diagnostic model with the highest accuracy as the final exposure risk prediction model.
[0032] The present invention has the following advantages and effects compared with the prior art:
[0033] 1. This invention systematically evaluates the association between exposure to mixed organic pollutants and the risk of ischemic stroke, identifies key pollutants, quantifies the impact of pollutant mixtures on the systemic metabolic network and stroke risk, reveals the mediating role of metabolic pathway disorders in pollutant-induced stroke risk, and explores the potential of incorporating pollutant exposure information into clinical prediction models to improve the effectiveness of ischemic stroke risk stratification, providing new technical tools for precise disease prevention.
[0034] 2. This invention integrates environmental exposure biomarkers into its disease risk assessment, providing a new paradigm for elucidating the relationship between mixed pollutant exposures and specific diseases. Metabolomics technology can also unbiasedly reveal downstream biomolecular perturbations induced by exposure, thus strongly supporting exposomics research. Metabolomics serves as a crucial bridge connecting the exposomics and disease endpoints, and integrating high-throughput multi-omics technologies holds promise for comprehensively revealing the overall picture and molecular mechanisms by which environmental pollutants affect health. Attached Figure Description
[0035] Figure 1 This is a typical chromatographic separation diagram of simultaneous determination of organic pollutant exposure group and oxidative damage index in urine in standard solution.
[0036] Figure 2 This study analyzed the association between urinary organic contaminants, oxidative damage, and the risk of ischemic stroke. A represents the comparison of urinary contaminant exposure and oxidative damage levels between the case and control groups, expressed as fold changes after creatinine correction. B represents the association between each indicator and the risk of ischemic stroke obtained through logistic regression analysis. ns: no statistical significance; *p<0.05; **p<0.01.
[0037] Figure 3 It is based on the Bayesian nuclear machine regression (BKMR) model to identify key pollutants that pose a risk of ischemic stroke.
[0038] Figure 4 The conditional posterior inclusion probability of each pollutant is calculated based on the BKMR model.
[0039] Figure 5 The combined effects of pollutants associated with an increased risk of ischemic stroke are characterized by a weighted quantile and regression (WQS) model.
[0040] Figure 6 This study uses a random forest classifier to assess the importance of demographic, clinical, and environmental exposure factors in the risk of ischemic stroke.
[0041] Figure 7 This is a principal component analysis (PCA) score plot of the test samples (blue) and quality control samples (red) from metabolomics data.
[0042] Figure 8 It refers to the metabolic features identified in metabolomics analysis, along with their retention time and mass-to-charge ratio distribution.
[0043] Figure 9 The differences in metabolic profiles between the control group and the ischemic stroke group were demonstrated based on orthogonal partial least squares discriminant analysis.
[0044] Figure 10The volcano plot shows the changes in plasma metabolite levels in patients with ischemic stroke (colored dots represent metabolites that meet the following thresholds: fold change >1.1 or <0.9, false discovery rate corrected p-value <0.05, and variable importance projection value >1.0).
[0045] Figure 11 This is a chemical classification of significantly differentially metabolites associated with ischemic stroke identified in metabolomics analysis (only major categories are shown); where OHCs: organic heterocyclic compounds; OADs: organic acids and their derivatives; LLLMs: lipids and lipid molecules; BZDs: benzene compounds; OOCs: organic oxygen compounds; the same applies below.
[0046] Figure 12 The analytic hierarchy process (AHP) and variogram analysis based on redundancy analysis were used to assess the explanatory power of urinary organic contaminants on plasma metabolome (blue) variability and ischemic stroke severity (purple).
[0047] Figure 13 The partial least squares discriminant analysis model shows the changes in the metabolic profile with the quartile levels of different pollutants.
[0048] Figure 14 This is a metabolomic association analysis between urine exposure and the severity of ischemic stroke (NIHSS); where NS: insignificant compounds; Others: others.
[0049] Figure 15 It is a metabolite that is significantly associated with both the exposure group and the severity of ischemic stroke (NIHSS) as shown in Venn diagrams.
[0050] Figure 16 It is a polar coordinate heatmap with a tree diagram, showing the β coefficients of metabolites that are significantly associated with both the exposure group and the severity of ischemic stroke in metabolomics studies.
[0051] Figure 17 It is a structural equation model: a mediating effect analysis of the relationship between metabolomics and urine exposure group and the severity of ischemic stroke (NIHSS).
[0052] Figure 18 This refers to the predictive accuracy of the ischemic stroke prediction model built using exposure group and metabolome data.
[0053] Figure 19 These are the top ten features among the various predictive sub-models for ischemic stroke. Detailed Implementation
[0054] The present invention will be further described in detail below with reference to the embodiments and accompanying drawings, but the embodiments of the present invention are not limited thereto.
[0055] Example
[0056] Screening and identification of biomarkers for ischemic stroke integrating environmental exocomics and metabolomics, and a risk assessment model and model construction method for ischemic stroke.
[0057] (1) The research protocol of this invention has been reviewed and approved by the Research Ethics Committee of the Third People's Hospital of Huizhou City (Approval No.: 2024-KY-022-01) and the Clinical Research Ethics Committee of the Affiliated Hospital of Guangdong Medical University (Approval No.: PJKT2025-124). From 2024 to 2025, a total of 510 eligible participants were enrolled in two prefecture-level cities (prefecture-level cities A and B) to form a research cohort. Among them, 164 participants were enrolled in prefecture-level city A (131 patients with ischemic stroke and 33 healthy controls), and 346 participants were enrolled in prefecture-level city B (234 patients with ischemic stroke and 112 healthy controls). In this paper, the ischemic stroke patient group is also referred to as the case group.
[0058] Inclusion criteria for patients with ischemic stroke included: 1) diagnosis by magnetic resonance imaging or other relevant imaging techniques; 2) enrollment within 24 hours of onset; and 3) age between 40 and 80 years. Exclusion criteria included: severe cardiac, pulmonary, hepatic, or renal insufficiency; confirmed malignancy; active infection; or inability to provide complete biological samples.
[0059] Healthy controls must be confirmed to have no cerebral infarction lesions by magnetic resonance imaging and / or computed tomography, and be matched to the patient group in age.
[0060] All participants completed a standardized questionnaire on demographic characteristics, environmental exposure information, and family medical history after signing a written informed consent form. Biosample collection included 10 mL of urine, 2 mL of serum, and 2 mL of plasma. Samples were collected the day after admission, after an overnight fast. Urine samples were immediately stored at -20°C; plasma samples were aliquoted and frozen at -80°C until subsequent analysis. Urine samples were used for environmental pollutant exposure proteomics analysis, serum for clinical biochemical marker detection, and plasma for metabolomics analysis.
[0061] Stroke severity was assessed using the National Institutes of Health Stroke Scale (NIHSS), a standardized neurological function assessment tool comprising 15 items that has been widely clinically validated. The NIHSS assessment includes aspects such as level of consciousness, cranial nerve function, motor function, sensation, coordination, language, and attention. The total score ranges from 0 to 42 points, with higher scores indicating more severe neurological deficits.
[0062] Based on prior evidence, this invention selected ten physiological indicators related to the risk of ischemic stroke for measurement, including eight serum biochemical indicators and two blood pressure indicators. All subjects fasted for at least 12 hours, and venous blood samples were collected in the morning by trained blood collection personnel. The eight serum biochemical indicators, including fasting blood glucose (GLU), serum uric acid (SUA), total cholesterol (CHOL), triglycerides (TG), high-density lipoprotein cholesterol (HDL), low-density lipoprotein cholesterol (LDL), homocysteine (HCY), and glycated hemoglobin (HbA1c), were all measured on a fully automated biochemical analyzer, and the operation process strictly followed the standardized procedures provided by the reagent manufacturer.
[0063] Systolic blood pressure (SBP) and diastolic blood pressure (DBP) were measured according to guidelines published by the American Heart Association and performed using standardized operating procedures to ensure the reliability and comparability of the data.
[0064] The baseline characteristics of the participants are shown in Table 1. The case group and the control group were well matched in terms of sex, body mass index, smoking and alcohol consumption status, and family history of cardiovascular and cerebrovascular diseases. The mean age of the case group was significantly lower than that of the control group, and their education level was also lower. Regarding physiological risk indicators, the case group exhibited more significant cerebrovascular risk characteristics, manifested by significantly elevated fasting blood glucose, diastolic blood pressure, and glycated hemoglobin levels, while significantly decreased high-density lipoprotein cholesterol levels (p<0.05).
[0065]
[0066] (2) Ultra-high performance liquid chromatography-quadrupole-electrostatic field orbital trap high-resolution mass spectrometry (UPLC-Orbitrap-HRMS) was used to target and screen more than 200 organic pollutants and their metabolites in the urine of the case group and the healthy control group:
[0067] The pretreatment procedure for solid-phase extraction of urine samples is as described in “Targeted High-Throughput Quantitative Method 2” (paragraph 46 of the instruction manual) in CN121253735A.
[0068] Analytical Procedure: Chromatographic separation was performed on a Hypersil GOLD C18 column (150 × 2.1 mm, 1.9 μm) at a column temperature of 35°C. The mobile phase consisted of an aqueous solution (A) containing 0.05% acetic acid and methanol (B) at a flow rate of 0.25 mL / min. The gradient elution program was set as follows: 0–1.5 min, 5–10% B; 1.5–4 min, 10–45% B; 4–19 min, 45–78% B; 19–21 min, 78–100% B; 21–25 min, 100% B; 25.01–28.5 min, 5% B. Mass spectrometry detection was performed using negative ion electrospray ionization mode with a spray voltage of –3.2 kV, an ion transfer tube temperature of 320°C, and a vaporizer temperature of 350°C. The sheath gas, auxiliary gas, and purge gas flow rates were set to arbitrary units of 45, 8, and 1, respectively. Data were acquired using targeted selective ion monitoring mode with an orbital trap mass resolution of 45,000 and an RF lens voltage of 70%. All analytes, except for 8-OHdG and 8-OHG, were well separated chromatographically. Figure 1 ), and the internal standard method was used for quantification.
[0069] Quality Assurance and Control: To confirm the reliability of the analytical methods, a systematic quality assurance and control procedure was implemented. Isotope-labeled internal standards were added to ultrapure water and process blank samples to monitor for potential background contamination. Only trace levels of CEL, D-Neop, DMP, and MBP were detected in the process blank (Table 2), but their concentrations were much lower than those in actual urine samples (1–2 orders of magnitude lower), and therefore had no significant impact on the quantitative results. The method limits of quantitation for each analyte ranged from 0.001 to 92.8 μg / L. Accuracy and precision were assessed by analyzing quality control samples (N = 11) with spiked concentrations ranging from 0.5 to 2000 μg / L. The recoveries of most analytes fell within the acceptable range of 70% to 130%; the recoveries of a few compounds, such as 1-OHPhe, D2EHPA, PGA, PP, MU, 1,2-DB, and MECPP, deviated slightly (Table 3). Precision assessment showed that, except for 1,2-DB and IPMA3, the relative standard deviations (RSDs) of all analytes in the quality control samples were below 20%. Further reproducibility verification was performed through repeated analysis of 11 mixed urine samples, and the RSDs of most analytes were less than 25% (Table 3). These results demonstrate that this method possesses good accuracy, precision, and reproducibility, making it suitable for the simultaneous and robust determination of multiple contaminants in large-scale biological samples.
[0070]
[0071]
[0072]
[0073]
[0074]
[0075] Based on the preliminary screening results, 61 representative exposure biomarkers with high detection rates were finally identified from the mixed urine samples, and a comprehensive organic pollutant exposure group was constructed.
[0076] This exposure group covers the following eight classes of compounds and their corresponding biomarkers: 14 volatile organic compound metabolites (AAMA, DHBMA, CYMA, 3-HPMA, IPMA3, BPMA, 1,2-DB, MU, BMA, PGA, MA, 2-MHA, 3-MHA, and 4-MHA), 11 polycyclic aromatic hydrocarbon hydroxyl metabolites (1-OHNap, 2-OHNap, 2-OHFlu, 3-OHFlu, 1-OHPhe, 2-OHPhe, 3-OHPhe, 4-OHPhe, 9-OHPhe, 1-OHPyr, and 2-NapCA), and 8 chlorinated flame retardants and their metabolites (4-CCT, 4-Mono-CP, 2,5-Di-CP, 3,4-Di-CP, 3,5 ... -Di-CP, 2,4,5-Tri-CP, 2,3,4,6-Tetra-CP and TCS), 2 brominated flame retardant metabolites (4-Mono-BP and 2,4,6-Tri-BP), 3 bisphenols and perfluorinated / polyfluoroalkyl substances (BPS, TFA and TFMS), 5 parabens (HACe, MP, EP, PP and BP), 7 organophosphate metabolites or related compounds (DMTP, DEP, TEP, bis-BP, DtP, DtEP and D2EHPA), and 11 phthalate monoester metabolites (DMP, MBP, MiBP, MECPP, MEHHP, MECPTP, DPrP, DBP, DiPeP, DPeP and IPPeP).
[0077] Simultaneously, 10 urinary biomarkers reflecting the body's oxidative stress level were measured in parallel, covering the following categories: nucleic acid oxidative damage markers (8-OHG and 8-OHdG), cholesterol oxidation derivatives (GCA, CA, GCDCA, CDCA), advanced glycation end product precursors (CEL and CML), and other categories (D-Neop and 4-OHNOPHAX).
[0078] (3) Untargeted metabolomics analysis was performed on plasma samples from all 510 participants to systematically reveal endogenous metabolic perturbations associated with ischemic stroke. All omics analyses were performed on the UPLC-Orbitrap-HRMS platform, as detailed below:
[0079] Metabolomics sample pretreatment: Take 150 μL of plasma, add 450 μL of methanol pre-cooled to -80 °C, vortex for 1 minute, and incubate overnight at -80 °C to precipitate proteins. The next day, centrifuge the sample at 14000 rpm for 60 minutes, and transfer 300 μL of the supernatant to a sample vial, storing at -80 °C. Before injection, centrifuge the sample again at 13000 rpm for 10 minutes, take 200 μL of the supernatant, add 50 μL of isotope-labeled internal standard solution and 30 μL of 1 mg / mL ascorbic acid solution (to prevent analyte oxidation), mix well, and then inject.
[0080] Metabolomics chromatographic and mass spectrometric conditions: A Hypersil GOLD AQ C18 column (2.1 × 100 mm, 1.9 μm) was used at a column temperature of 35°C. Mobile phase: A was water (negative ion mode) or an aqueous solution containing 0.1% formic acid (positive ion mode); B was methanol. Gradient elution (flow rate 0.25 mL / min): 0–2 min, 2% B; 2–8 min, 2% → 98% B; 8.01–11 min, 98% B (flow rate increased to 0.40 mL / min); 11.01–14 min, 2% B (flow rate 0.25 mL / min). Electrospray ionization was used, and the elution was performed in both positive and negative ion modes. The specific mass spectrometry parameters were set as follows: spray voltage was 3.5 kV for positive ion mode and -3.2 kV for negative ion mode; ion transfer tube temperature was 320°C, and nebulizer temperature was 350°C; the flow rates of sheath gas, auxiliary gas, and purge gas were set to 45, 8, and 1 arbitrary units, respectively. In the global settings, the expected peak width was 10 seconds, and EASY-IC™ was enabled for real-time internal standard mass correction. Data acquisition employed a hybrid mode including full scan and data-dependent secondary mass spectrometry. The full scan resolution was 240,000, and the mass scan range was set to 70–1000 m / z. To reduce background interference, the 100 ions with the highest response mass numbers were excluded from the analysis blank sample. In the data-dependent acquisition mode, the top 5 precursor ions with the highest intensity in the full scan were automatically selected for fragmentation. During fragmentation, the orbital trap resolution was 15,000, and step-normalized collision energies of 20%, 50%, and 80% were applied sequentially. In addition, to improve the reliability of identification, a segmented precursor ion scanning strategy was employed to optimize the acquisition of fragment ion information. Sample data acquired in this mode was subsequently used exclusively for compound identification.
[0081] Metabolomics Data Processing: All raw data were processed automatically using Compound Discoverer 3.3 SP2 software. The main workflow and key parameters are as follows: ① Peak Detection and Alignment: The quality deviation window for the extracted ion chromatogram was set to 10 ppm. Chromatographic peaks must meet the minimum baseline peak height of 10000 and a signal-to-noise ratio ≥5. Retention time alignment was performed using the ChromAlign algorithm, with the chromatogram of the quality control sample in the sequence as a reference; ② Feature Filtering and Missing Value Imputation: Only compound features with an original peak rating ≥4 and detected in ≥50% of the samples were retained. Missing value imputation was performed based on the maximum background ratio (5:1) of the sample to blank signal and the centroid signal-to-noise ratio threshold (1.5). Restrictive gap filling was used, and retention time tolerance was strictly constrained to avoid incorrect peak grouping; ③ Data Correction and Standardization: A random forest-based systematic error correction algorithm (200 trees) was used to perform batch effect correction on the quality control samples. The maximum allowable relative standard deviations of the characteristic ions of the quality control samples before and after correction were 30% and 25%, respectively. Subsequently, the chromatographic peak areas of each sample were normalized using the "constant median" algorithm; ④ Metabolite annotation: Compound identification was completed by matching the precise molecular weight and secondary mass spectrometry fragment information obtained from the experiment with ChemSpider, mzVault, mzCloud, Mass Lists, and internal databases. Each compound was required to have at least a 30% effective matching rate across different samples. The reliability of the identification was divided into four levels according to the standards of published literature (J HazardMater. 2024;480:135807) to ensure the reliability of the identification results.
[0082] Quality Control and Assurance: To ensure the reliability and reproducibility of metabolomics data and minimize non-biological variation during analysis, this study implemented a systematic quality assurance and control scheme, as follows: ① Background Interference Control: Process blank samples (containing only extraction solvent) were interspersed in each batch of analysis to monitor and subtract background signals from experimental materials, reagents, and the environment; ② Process Stability Monitoring: Quality control samples prepared by mixing equal volumes of plasma from all participants were periodically injected into the sample sequence to assess the systematic stability and reproducibility of sample pretreatment and instrument analysis; ③ Analysis Sequence Optimization: The sample injection sequence was arranged using a completely randomized principle, and samples from different study areas and quality control samples were evenly interspersed in the sequence to eliminate instrument response drift caused by time; ④ Batch Effect Correction: A systematic error correction algorithm based on random forest was applied to model and correct systematic biases introduced by factors such as analysis date and batch number. After processing by this algorithm, the quality control samples showed significantly tighter clustering in the multidimensional statistical model, confirming its effective correction of batch-related variation in metabolomics data. The above comprehensive quality control strategies jointly ensured the high quality and comparability of the obtained omics data, laying a reliable foundation for subsequent statistical analysis and biological interpretation.
[0083] (4) All statistical analyses in this invention were performed using R language (version 4.3.1) and Python (version 3.9.0). For continuous variables that conform to a normal distribution, Student's t-test was used for inter-group comparisons; for those that do not conform to a normal distribution, Mann-Whitney U test was used. For categorical variables, chi-square test was used for inter-group comparisons. All tests were two-tailed, and a p-value < 0.05 was considered statistically significant. To control for potential confounding factors, the following models were adjusted for ten covariates: age, body mass index, gender, geographic region, education level, smoking status, alcohol consumption status, and family history of cardiovascular and cerebrovascular diseases.
[0084] The specific analysis framework and steps are executed sequentially as follows:
[0085] ① Association analysis between exposure groups and disease risk: First, a single-pollutant logistic regression model was used to preliminarily screen urinary organic pollutants and oxidative damage biomarkers associated with the risk of ischemic stroke. To assess the overall effect of mixed pollutant exposure and address collinearity issues, quantile g calculation and weighted quantile regression models were further applied. Simultaneously, a Bayesian kernel machine regression model was used to capture potential nonlinear exposure-response relationships and interactions between components in the mixture. Conditional posterior inclusion probabilities of each pollutant were calculated using 15,000 Markov chain Monte Carlo sampling runs to quantify their relative importance in inducing disease.
[0086] The association between urinary organic pollutants and oxidative damage biomarkers and the risk of ischemic stroke was analyzed. The results showed significant differences in the exposure profiles between the case group and the control group (Table 4). Levels of multiple compounds were significantly elevated in the case group and positively correlated with disease risk. Figure 2 The study included two oxidative damage biomarkers (8-OHG and D-Neop), three volatile organic compound metabolites (CYMA, 3-HPMA, and MA), five polycyclic aromatic hydrocarbon metabolites (2-OHNap, 2-OHFlu, 3-OHFlu, 1-OHPyr, and 2-NapCA), one brominated flame retardant (2,4,6-Tri-BP), two perfluorinated / polyfluoroalkyl substances (TFA and TFMS), one paraben (EP), four organophosphates (bis-BP, DtP, DtEP, and D2EHPA), and five phthalates (MECPP, MEHHP, MECPTP, DBP, and DiPeP / DPeP / IPPeP). Among these, organophosphates and phthalates exhibited a higher disease risk ratio.
[0087]
[0088]
[0089] Given the significant correlations among pollutants, this invention employs a multi-pollutant exposure model to address the collinearity problem. Bayesian kernel machine regression analysis revealed a significant nonlinear positive correlation between the overall exposure level of the mixture and the NIHSS score when the overall exposure level was at or below the 65th percentile (p<0.01). Figure 3 Among them, 10 pollutants with a conditional posterior inclusion probability greater than 0.85 were identified as key contributing components of the mixture effect. Figure 4 Weighted quantile and regression analyses further showed that for every unit increase in the mixture exposure index, the stroke scale score increased by an average of 0.517 (95% CI: 0.362, 0.672). Through weighted quantile and weighted analysis, dibutyl phthalate, ethylparaben, trifluoroacetic acid, and mono-2-ethylhexyl phthalate were identified as the major drivers of the mixture effect. Figure 5 It is noteworthy that phthalates and parabens together contributed 65.2% of the total mixture effect, suggesting that they are the dominant components in this exposure mixture that increase the risk of ischemic stroke.
[0090] ② Risk impact factor assessment: Using a random forest model, we comprehensively assessed the relative contributions of demographic characteristics, clinical physiological indicators, and organic pollutant exposure to the risk of ischemic stroke.
[0091] The relative importance of each factor was assessed using a random forest model. Figure 6 Importance analysis of variables showed that fasting blood glucose was the most influential predictor. Besides blood glucose, the overall impact of organic pollutant exposure exceeded most demographic and traditional physiological risk factors, highlighting the important role of environmental exposure in disease development. Among the top eight pollutants associated with ischemic stroke risk, four were phthalates, consistent with previous findings. Given that phthalates and parabens are known metabolic disruptors, further analysis of plasma metabolomics changes is crucial for elucidating their pathogenic mechanisms.
[0092] ③ Screening of disease-related metabolites: Partial least squares discriminant analysis was used to screen metabolites significantly associated with ischemic stroke status based on plasma metabolomics data. The screening criteria were: fold change in expression between cases and controls >1.1 or <0.9, p-value after error discovery rate correction <0.05, and variable importance projection value >1.0.
[0093] Untargeted metabolomics ( Figure 7 The analysis showed good repeatability and stability, and the tight clustering of the quality control samples in the principal component analysis score plot confirmed the reliability of the data quality.
[0094] Metabolomics analysis detected 3554 metabolite features, and after annotation, 942 metabolites were identified in positive ion mode. Figure 8 In the negative ion mode, 2612 (A) were identified. Figure 8 (B in the middle).
[0095] Orthogonal partial least squares discriminant analysis showed that the ischemic stroke group and the healthy control group were significantly separated in terms of plasma metabolomic profiles, and there was no overlap in the sample points between the two groups. Figure 9 Model validation results showed that the model was robust and not overfitted (metabolome: R²Y = 0.387, Q² = -0.44). Differential analysis revealed that, compared with the control group, 433 metabolites were significantly upregulated and 529 were significantly downregulated in the case group. Figure 10 ).
[0096] By comparing with the human metabolome database and the lipid metabolite and pathway strategy database, 160 metabolites were annotated in the metabolome. In terms of categories, the most significantly different metabolite categories in the metabolome were, in descending order: organic heterocyclic compounds (36), organic acids and their derivatives (27), and lipids and lipid molecules (26). Figure 11 Subsequent analyses only included metabolites with an annotation confidence level of 1 or 2.
[0097] ④ Quantification of the contribution of the exposure group to metabolic phenotype and disease severity: Based on redundancy analysis, and using the rdacca.hp R package for hierarchical segmentation and variance decomposition, the independent and joint contributions of the exposure group to plasma metabolic profile variation and NIHSS were quantified. The model was iterated 100 times to obtain a stable mean explained variance estimate.
[0098] Analysis of variance showed that organic pollutants and oxidative stress markers together explained 15.2% of the total variation in the plasma metabolome. Figure 12 Among them, key pollutants such as 2,4,6-tribromophenol (contribution 1.08%), mono-2-ethyl-5-carboxypentyl phthalate (1.06%), di(2-ethylhexyl) phosphate (0.99%), and mono-2-ethylhexyl phthalate (0.99%) made significant contributions to metabolome variation.
[0099] Partial least squares discriminant analysis further revealed that as the exposure levels of the aforementioned key pollutants increased, the metabolic profile exhibited a dose-dependent gradient separation trend. Figure 13 Notably, plasticizers (2-ethylhexyl phthalate, 2-ethyl-5-carboxypentyl phthalate, and di(2-ethylhexyl) phosphate) played a prominent role in this model, suggesting that plasticizer exposure is a major driver of metabolic disorders in stroke patients. Meanwhile, the direct explanatory power of pollutants on the severity of ischemic stroke was only 10.9%, far lower than their effect on metabolomic variability. These results suggest that organic pollutants exacerbate the severity of ischemic stroke primarily through indirect interference with metabolic homeostasis, rather than direct effects. Therefore, elucidating the role of pollutants... Metabolic disorders The mediating pathways between disease progression are of key scientific significance.
[0100] ⑤ Whole-omics association analysis and mediation effect test: Weighted quantile and regression and multiple linear regression models were used to systematically analyze the association between metabolomics characteristics and the exposure group and stroke severity. To explore the potential mediating role of metabolites in the "exposure-disease severity" relationship, a structural equation model was constructed for analysis. The latent variables of the metabolomics were constructed through the following two steps: First, key features were screened from high-dimensional data using minimum absolute contraction and selection operator regression, with the λ value selected as the minimum cross-validation error within one standard error, ultimately identifying 21 metabolites; second, these metabolites were used as manifest variables to construct latent variables.
[0101] The results of the whole metabolome association analysis showed that the exposed group was significantly associated with 70 of the 160 annotated metabolites (standardized β coefficient range: -0.370 to 0.533). Figure 14Of these, 157 metabolites were significantly associated with NIHSS scores. Wien analysis identified 68 metabolites that were significantly associated with both the exposure group and NIHSS scores. Figure 15 The main categories are organic heterocyclic compounds, lipids and lipid molecules, and benzene ring compounds. Correlation analysis showed that 27 metabolites were consistently positively correlated with both, and 37 were consistently negatively correlated. Figure 16 ).
[0102] Pathway enrichment analysis of the metabolites that were significantly associated with both exposure and disease revealed that thiamine metabolism, phenylacetate metabolism, purine metabolism, alanine metabolism, and butyrate metabolism were the top five enriched metabolic pathways, suggesting that these pathways may play a key role in pollutant-mediated stroke pathology.
[0103] Based on the above associations, structural equation modeling was further used to quantify the mediating effect of the metabolome. The model results showed that organic pollutant exposure had a significant positive overall effect on stroke severity. The metabolome exhibited a significant mediating role, with estimated indirect effects of β = 0.404. Figure 17 The mediating effect accounted for 70.3% of the total effect. These results indicate that the main pathway by which exposure to organic pollutants exacerbates the severity of ischemic stroke is through disruption of plasma metabolomic homeostasis, providing important evidence for understanding the biological mechanisms by which environmental exposure leads to cerebrovascular diseases.
[0104] ⑥ Construction and Evaluation of Disease Risk Prediction Model: Through testing different machine learning algorithms, the Support Vector Machine (SVM) algorithm was found to have the best predictive performance. Therefore, a risk prediction model for ischemic stroke was constructed based on SVM, integrating metabolomics and exposure group data. The specific process is as follows:
[0105] Data partitioning for training, testing, and validation sets: 40% of the samples from the healthy control group and the patient group in prefecture-level city B were randomly assigned to the independent validation set, which did not participate in model training. The remaining samples (prefecture-level cities A and B) constituted the training set, which was further divided into a model training subset and an internal testing subset in a 4:1 ratio.
[0106] Data preprocessing: Standardize all features using the StandardScaler function from the Scikit-learn library.
[0107] Feature selection and model optimization: First, minimum absolute shrinkage and selection operator regression are used to screen key predictor variables. The support vector machine model uses radial basis function kernels, and to handle class imbalance, the class_weight parameter is set to 'balanced'. Key hyperparameters (regularization parameter C and kernel coefficient gamma) are optimized using a grid search combined with five-fold cross-validation, with accuracy used as the evaluation metric.
[0108] Model performance evaluation: A five-fold cross-validation framework was used to evaluate model performance to prevent overfitting. The final model performance was reported as the mean area under the receiver operating characteristic (AUC) and mean accuracy (ACC) of the five cross-validations. The incremental contribution of environmental exposure information to risk prediction was assessed by comparing the predictive power of the model with and without pollutant indicators.
[0109] Feature importance evaluation: First, an SVM model is trained on the training set, and a baseline performance score is obtained on the original validation set. Then, for each feature, its values are randomly shuffled in the validation set (while other features and labels remain unchanged). The trained model is used to make predictions, and a new performance score is calculated. The single-step importance score for each feature is obtained by comparing the degree of performance degradation before and after the shuffling. To reduce the influence of randomness, the above shuffling and evaluation process is repeated 10 times for each feature, and the average degradation is taken as the final importance score for that feature. A higher importance score indicates a greater contribution of the feature to the model's predictions.
[0110] Ischemic stroke prediction models were constructed, and their performance was evaluated on a test set and an independent validation set (test set accuracy is denoted as ACC1, and validation set accuracy is denoted as ACC2). The specific results are as follows:
[0111] First, 18 organic pollutant exposure biomarkers with a detection rate >40% were used as pollutant indicators for predicting ischemic stroke (Table 5), and 23 high-confidence metabolites closely associated with the organic pollutant exposure group and confirmed by secondary spectrum matching were used as metabolite indicators for predicting ischemic stroke (Table 6).
[0112] When using paired data (N=507) of the exposure group (Table 5) and metabolomics (Table 6) for training and prediction ( Figure 18 The model built based on metabolomics data (Table 6) showed the best predictive performance (AUC=0.990, ACC1=0.945, ACC2=0.845); while the model built solely based on exposure group data (Table 5) had an AUC of 0.793, ACC1 of 0.777, and ACC2 of 0.736. Notably, combining exposure group and metabolomics data (Tables 5 and 6) resulted in a slight improvement in the accuracy of the model on both the test and validation sets (AUC=0.986, ACC1=0.950, ACC2=0.882), suggesting that environmental pollutant exposure groups contribute to enhancing the prediction of ischemic stroke risk.
[0113] Further through feature importance analysis ( Figure 19The results showed that, in the exposure-metabolome coupling model, the top ten features contributing to predictive performance were, in descending order: HMDB0012275 (phenylethylamine), HMDB0000687 (L-(+)-leucine), HMDB0002250 (docosacarnitine), HMDB0001539 (N,N-dimethylarginine), HMDB0000251 (taurine), HMDB0002117 ((9Z)-9-octadecenoamide), HMDB0030964 (linolenic acid), 3-OHFlu (3-hydroxyfluorene), HMDB0001847 (caffeine), and MEHHP (mono(2-ethyl-5-hydroxyhexyl) phthalate). These biomarkers play a crucial role in improving the model's predictive ability.
[0114]
[0115]
[0116] This invention, through the systematic integration of exocomics and metabolomics data, provides, for the first time, quantitative evidence of the broad impact of organic pollutants on ischemic stroke. The study found that environmental pollutant exposure—especially plasticizers—is a major contributing factor beyond traditional demographic and clinical risk factors. Quantitative analysis showed that pollutant exposure explains 15.2% of metabolomic variation and 10.9% of the risk of ischemic stroke. Further mediation analysis confirmed that this risk is primarily mediated by pollutant-induced changes in plasma metabolomics. Based on whole-metabolomics association analysis, a series of key metabolites linking pollutant exposure to stroke severity were also identified, providing clues to revealing related pathological pathways. Crucially, incorporating pollutant exposure data into the risk model significantly improved the predictive accuracy of ischemic stroke, thus highlighting the important value of exocomics in early disease warning.
[0117] The above embodiments are preferred embodiments of the present invention, but the embodiments of the present invention are not limited to the above embodiments. Any changes, modifications, substitutions, combinations, or simplifications made without departing from the spirit and principle of the present invention shall be considered equivalent substitutions and shall be included within the protection scope of the present invention.
Claims
1. A method for constructing an ischemic stroke risk assessment model, characterized by Includes the following steps: S1: Obtain multiple biomarkers in urine and plasma from the case group and the healthy control group; The biomarkers mentioned are biomarkers that integrate environmental exocomics and metabolomics, including phenethylamine, L-(+)-leucine, dodecanoic acid, N,N-dimethylarginine, taurine, (9Z)-9-octadecenoamide, linolenic acid, 3-hydroxyfluorene, caffeine and mono(2-ethyl-5-hydroxyhexyl) phthalate; S2: Establish training and validation sample sets based on various biomarkers; S3: Train machine learning models using training sample sets corresponding to various biomarkers to obtain various primary diagnostic models; wherein, during the training process, the concentration of biomarkers is used as the input feature of the machine learning model, and the disease status of the objects collected by the biomarkers is used as the label; the disease status includes those who are diseased and those who are not diseased; S4: For various primary diagnostic models, the accuracy is verified using validation sample sets corresponding to various biomarkers; S5: Select the primary diagnostic model with the highest accuracy as the final exposure risk prediction model.
2. The construction method of claim 1, wherein: The machine learning model mentioned is a random forest model, support vector machine, or neural network model.
3. An apparatus of an ischemic stroke risk assessment model, characterized by include: The acquisition module is used to acquire multiple biomarkers in urine and plasma; The biomarkers mentioned are biomarkers that integrate environmental exocomics and metabolomics, including phenethylamine, L-(+)-leucine, dodecanoic acid, N,N-dimethylarginine, taurine, (9Z)-9-octadecenoamide, linolenic acid, 3-hydroxyfluorene, caffeine and mono(2-ethyl-5-hydroxyhexyl) phthalate; The sample construction module is used to build training and validation sample sets based on various biomarkers. The training module trains machine learning models using training sample sets corresponding to various biomarkers to obtain various primary diagnostic models. During training, the concentration of biomarkers is used as the input feature of the machine learning model, and the disease status of the objects collected by the biomarkers is used as the label. The disease status includes both diseased and non-diseased. Validation module: For various primary diagnostic models, the accuracy is verified using validation sample sets corresponding to various biomarkers; Select Module: Choose the primary diagnostic model with the highest accuracy as the final exposure risk prediction model.
4. The apparatus of claim 3, wherein: The machine learning model mentioned is a random forest model, support vector machine, or neural network model.