A multi-dimensional non-invasive combined screening method for polycystic ovary syndrome
By constructing a multi-dimensional non-invasive screening model and a personalized visualization solution, the accuracy and practicality issues of PCOS screening have been resolved, screening efficiency and applicability at the grassroots level have been improved, and personalized management of high-risk groups has been achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- KUNMING MEDICAL UNIVERSITY
- Filing Date
- 2026-03-14
- Publication Date
- 2026-06-12
AI Technical Summary
Existing PCOS screening technologies suffer from limitations such as insufficient screening based on single-dimensional indicators, invasive procedures, high costs, and low applicability to primary healthcare institutions. These limitations result in insufficient screening accuracy and practicality, making it difficult to meet the screening needs of large-scale populations at the grassroots level.
A multi-dimensional, non-invasive joint screening model based on machine learning was constructed. It combines 13 non-invasive indicators, including sociodemographic characteristics, clinical manifestations of hyperandrogenism, anthropometric indicators, past medical history and family medical history, and lifestyle behaviors. The XGBoost model was used for screening, and the interpretability analysis was performed using the SHAP method to form a personalized and visualized screening plan.
It significantly improves the sensitivity and Youden index of PCOS screening, reduces the rate of missed diagnoses, simplifies the operation process, reduces costs, and is suitable for large-scale screening in primary healthcare institutions, enabling personalized risk prediction and management.
Smart Images

Figure CN122201726A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of medical data processing technology, specifically a multidimensional non-invasive combined screening method for polycystic ovary syndrome. Background Technology
[0002] Polycystic ovary syndrome (PCOS) is a major cause of reproductive health and secondary infertility in women worldwide. It is also a risk factor for metabolic and endocrine disorders. The disease exhibits clinical heterogeneity, often resulting in late detection and delayed diagnosis, which seriously affects the effectiveness of interventions and prognosis.
[0003] Currently, in clinical practice, PCOS screening mainly relies on three methods: clinical symptom assessment, anthropometric index testing, and genetic and biochemical testing, but all of them have significant limitations: 1. Clinical symptom assessment mainly includes the assessment of symptoms such as menstrual cycle, hirsutism, and acne. Although it is convenient to operate, it is highly subjective and has low diagnostic accuracy for patients with atypical symptoms. 2. Anthropometry indicators mainly include body mass index (BMI), waist circumference, waist-to-hip ratio, waist-to-height ratio, etc. This method is convenient and economical, but traditional single anthropometry indicators have low sensitivity for screening non-obese PCOS and PCOS combined with insulin resistance and other disease phenotypes, which can easily increase the rate of missed early screening and have problems such as incomplete screening and limited practical application. 3. Biochemical tests require venous blood collection, which is an invasive procedure and has high testing costs. Patient compliance is poor, making it difficult to meet the needs of early screening for large-scale populations at the grassroots level. 4. Although PCOS gene screening tools developed abroad have high specificity and sensitivity, they have limitations such as long detection cycle, high cost per test, complex screening process and need for professional equipment support. Due to limitations in equipment conditions and manpower costs, primary healthcare institutions find it difficult to carry out such screening, resulting in low applicability to primary and clinical settings.
[0004] In the current technology, most early screening studies for PCOS focus on single-dimensional indicators, and there is a lack of screening studies based on the combination of multiple non-invasive indicators. Moreover, existing related studies have not integrated core influencing dimensions such as lifestyle behaviors and family medical history, nor have they achieved personalized screening through interpretable models. This results in insufficient screening accuracy and practicality, which seriously limits the effectiveness of screening for this disease in primary healthcare institutions and the prevention and control of the disease. Summary of the Invention
[0005] This invention addresses the shortcomings of existing PCOS screening technologies. It aims to improve the comprehensiveness, efficiency, and practical application of early screening by screening non-invasive features with high predictive value for PCOS, and constructing a multi-dimensional joint screening model based on machine learning, combined with interpretability analysis methods, to achieve personalized early screening and visualized management of high-risk populations, taking into account the clinical heterogeneity of PCOS.
[0006] The technical solution adopted in this invention is: In a first aspect, the present invention provides a multidimensional non-invasive combined screening method for polycystic ovary syndrome, comprising the following steps: S1. Data Acquisition and Screening Indicator Pool Construction: A hybrid research method including literature review and expert consultation was adopted, combined with statistical analysis methods such as univariate analysis, correlation analysis and collinearity analysis, to select multidimensional non-invasive characteristic variables and construct a PCOS screening indicator pool. S2. Construction of a multi-dimensional non-invasive joint screening model: The optimal risk model suitable for early PCOS screening is modeled and comprehensively evaluated using a pre-set machine learning algorithm. At the same time, the screening efficacy is compared and analyzed with that of traditional single indicators. S3. Construction of Personalized Visualized Disease Screening Program: The SHAP method is used to perform a visual analysis of the interpretability of the predictive variable set of the optimal risk model, and the importance ranking of the PCOS risk prediction feature values is completed to form a visualized disease screening program.
[0007] Preferably, in step S1, evidence from clinical guidelines, research papers, clinical practices, and actual screening work is collected and analyzed using the hybrid research method to form a screening method information collection questionnaire, and information is collected through face-to-face surveys to establish an information database.
[0008] Preferably, in step S1, SPSS 25.0 software is used to establish a database and enter survey information to complete data cleaning, verification, description and analysis; the statistical analysis methods include univariate analysis, correlation analysis and collinearity analysis, and the variable selection of feature values is completed through two rounds of expert consultation.
[0009] Preferably, the multidimensional non-invasive characteristic variables in step S1 include five non-invasive dimensions: sociodemographic characteristics, clinical manifestations of hyperandrogenism, anthropometry indicators, past medical history and family medical history, and lifestyle behaviors. The screening indicator pool constructed based on the characteristic variables contains 13 non-invasive screening indicators.
[0010] Preferably, the preset machine learning algorithm in step S2 includes Logistic Regression Model, XGBoost Model, DT Model, RF Model, SVM Model, and NB Model.
[0011] Preferably, the model building process in step S2 includes model modeling and fitting, analysis of the screening and prediction capabilities of individual models, priority ranking of screening models, risk assessment and interpretability analysis of the optimal model; wherein, the grid search method combined with ten-fold cross-validation is used to optimize the parameters of all models, and the screening and prediction capabilities of individual models are evaluated by confusion matrix, ROC curve analysis and its 95% CI, while the screening efficacy is compared and analyzed with that of traditional single indicators.
[0012] Preferably, in step S2, the accuracy, sensitivity, specificity, positive predictive value, F1 score, AUC score, and Youden index of each model are comprehensively analyzed, and the performance differences between models are compared using the Delong method, so that the XGBoost model is selected as the optimal risk model for early PCOS screening.
[0013] Preferably, in step S3, the importance of the feature values of the optimal risk model is ranked using SHAP feature importance bar chart and feature contribution distribution summary chart, and the applicability of the optimal risk model and PCOS screening case analysis are completed using SHAP waterfall chart, and the feature interpretation analysis of the optimal risk model is comprehensively performed.
[0014] Preferably, the PCOS risk predictor variables obtained by the SHAP method are, in descending order of importance: hirsutism, acne, lifestyle, waist-to-height ratio, personal medical history, place of residence, age at menarche, body fat percentage, paternal PCOS-related disease status, body mass index, waist-to-hip ratio, and maternal PCOS-related disease status. A positive SHAP value indicates that the corresponding feature has a positive effect on the PCOS prediction result, while a negative value indicates that the corresponding feature has a negative effect on the PCOS prediction result.
[0015] Secondly, the present invention also provides an XGBoost model constructed by the screening method described in the first aspect. The model is constructed based on 13 non-invasive indicators from five non-invasive dimensions: sociodemographic characteristics, clinical manifestations of hyperandrogenism, anthropometry indicators, past medical history and family medical history, and lifestyle behaviors. This enables non-invasive early screening of PCOS and personalized, visualized management of high-risk groups. Compared with the prior art, the beneficial effects of the present invention are:
[0016] 1. A multi-dimensional, non-invasive combined screening method for PCOS was established. A screening system comprising 13 indicators across five dimensions—sociodemographic characteristics, clinical manifestations of hyperandrogenism, anthropometry, past medical history and family history, and lifestyle behaviors—was constructed. Validated by a machine learning model, this method demonstrated a sensitivity improvement of over 35% and a Youden's index improvement of over 28% compared to the traditional single BMI indicator, significantly outperforming traditional single indicators. This effectively improved the efficiency of PCOS screening and reduced the false negative rate. Furthermore, this screening method eliminates the need for invasive procedures such as intravenous blood collection, with a single screening session taking ≤5 minutes. Primary healthcare workers can successfully perform the screening after simple training, making it highly practical.
[0017] 2. A multi-dimensional non-invasive joint screening model for PCOS was initially constructed. Through machine learning feature value analysis and multi-model comprehensive evaluation, the XGBoost model was determined to be the screening model with the best performance. This model can be used as a measurement tool for early PCOS screening, improving the operability of large-scale early screening of PCOS and its application in primary care and clinical settings. In the test set, 90 subjects were identified as not having PCOS and were clinically diagnosed with non-PCOS by this model, while 44 subjects were identified as having PCOS and were clinically diagnosed with PCOS. The early screening capability is significantly better than other models. 3. A visual management method for PCOS screening was established. The SHAP method was used to intuitively explain the importance and contribution of screening variables in the model. Personalized PCOS risk predictor variables and their importance ranking were completed. The importance of variables, in descending order, is as follows: hirsutism, acne, lifestyle behaviors, WHtR (waist-to-height ratio), personal medical history, place of residence, age at menarche, BRI (body roundness index), paternal PCOS-related disease status, BMI (body mass index), WHR (waist-to-hip ratio), and maternal PCOS-related disease status. At the same time, case screening analysis can be completed through SHAP waterfall plots, realizing personalized and visual management of high-risk groups for PCOS and improving personalized risk prediction capabilities. Attached Figure Description
[0018] Figure 1 The confusion matrix diagram for six machine learning models on the test set; Figure 2 A ranking chart of SHAP values and feature importance based on the XGBoost screening model; Figure 3 Summary plot of SHAP values and feature contributions for XGBoost-based screening models; Figure 4A waterfall chart for personalized screening based on SHAP analysis; Figure 5 This is a module architecture diagram of the PCOS screening method. Detailed Implementation
[0019] The present invention will be further described in detail below with reference to specific implementation steps. This embodiment is only used to explain the present invention and is not intended to limit the scope of protection of the present invention.
[0020] like Figure 5 As shown, this embodiment provides a multidimensional non-invasive joint screening method for polycystic ovary syndrome, including three parts: data acquisition and screening indicator pool construction, multidimensional non-invasive joint screening model construction, and personalized visualization disease screening program construction. The specific technical solution is as follows: 1. Data Acquisition and Screening Indicator Pool Construction 1) Employing a mixed research approach, we collected and analyzed evidence from clinical guidelines, research papers, clinical practice, and actual screening work through literature review and expert consultation to comprehensively understand the influencing factors of PCOS and existing screening technologies; 2) Based on existing evidence, a screening method information collection questionnaire was developed. Information was collected through face-to-face surveys. A database was established using software 25.0 and the data was entered. Data cleaning, verification, description and analysis were completed. 3) Through statistical analysis methods such as univariate analysis, two-round expert consultation, correlation analysis, and collinearity analysis, five non-invasive multi-dimensional characteristic variables were selected, including: sociodemographic characteristics (such as place of residence and age of menarche), clinical manifestations of hyperandrogenism (such as hirsutism and acne), anthropometric indicators (such as BMI and WHtR), past medical history and family medical history (such as personal past illnesses and parents' PCOS-related diseases), and lifestyle behaviors. The two-round expert consultation focused on the effectiveness screening of indicators and the determination of indicator risk weights, and finally constructed a multi-dimensional non-invasive multi-indicator joint PCOS screening indicator pool.
[0021] 2. Construct a multi-dimensional, non-invasive joint screening model 1) Pre-defined machine learning algorithms are used to perform disease prediction feature value analysis and risk model construction on internal and external datasets. The pre-defined machine learning algorithms include Logistic Regression Model, XGBoost Model, DT (Decision Tree) Model, RF (Random Forest) Model, SVM (Support Vector Machine) Model, and NB (Naive Bayes) Model. The internal dataset consists of clinical PCOS-related samples from a hospital (which can be publicly obtained or obtained through data transactions), while the external dataset consists of samples of women of childbearing age collected from other medical institutions. Both datasets explicitly include women of childbearing age aged 15-45, excluding research subjects with thyroid dysfunction, ovarian tumors, or other endocrine and reproductive system diseases, to ensure the representativeness and validity of the datasets.
[0022] 2) Complete model modeling and fitting for each algorithm, analyze the screening and prediction capabilities of individual models, prioritize screening models, conduct risk assessment and interpretability analysis of the optimal model; 3) Using sensitivity and Youden index as core evaluation indicators, combined with accuracy, specificity, positive predictive value, F1 value, and AUC value, a risk model suitable for early PCOS screening is comprehensively analyzed. Among them, sensitivity is prioritized to ensure the control of the missed diagnosis rate of early screening, and Youden index comprehensively reflects the screening effectiveness of the model.
[0023] 3. Development of Personalized Visualized Disease Screening Solutions Considering the clinical heterogeneity of PCOS, the SHAP (SHapley Additive exPlanations) method is used to perform interpretability visualization analysis of the predictor variable set of the optimal screening model, completing the importance ranking of PCOS risk predictor variables and forming a visualized disease screening plan. This plan is presented as a comprehensive report including SHAP feature importance bar charts, feature contribution distribution summary plots, case SHAP waterfall plots, and risk level determination results, assisting clinicians and primary care medical staff in quickly completing screening assessments. Specifically, SHAP feature importance bar charts and feature contribution distribution summary plots are used to complete the risk assessment and applicability analysis of the optimal model, while SHAP waterfall plots are used to complete the interpretability analysis of the optimal model and case screening analysis.
[0024] The specific implementation steps are as follows: Step 1: Acquire multidimensional data and establish a screening indicator pool 1. Employing a mixed research approach, this study uses literature review and expert consultation to collect and analyze potential evidence from clinical guidelines, research papers, clinical practice, and actual screening work. It aims to comprehensively understand the influencing factors of PCOS and existing screening technologies from theoretical, clinical, epidemiological, and maternal and child health perspectives. 2. By analyzing existing evidence, a questionnaire for information collection on screening methods was developed, and relevant information of the study subjects was collected through face-to-face surveys. The inclusion criteria for the study subjects were women of childbearing age aged 15-45, and the exclusion criteria were personal medical history and family history, excluding endocrine and metabolic diseases secondary to diseases such as congenital adrenal hyperplasia, Cushing's syndrome, androgen-secreting tumors, functional hypothalamic amenorrhea, hyperprolactinemia, and premature ovarian insufficiency, to ensure the representativeness of the survey sample.
[0025] 3. SPSS 25.0 software was used to establish a database and input the collected information. Data cleaning, verification, description and analysis were completed. Data with more than 10% missing records were not included in the analysis. Missing values were handled using multiple imputation. Outliers were re-verified and then identified and removed using the Z-score method (|Z|>3) to ensure data quality.
[0026] 4. Using statistical analysis methods such as univariate analysis, correlation analysis, and collinearity analysis, characteristic variables that meet the criteria of P < 0.05, collinearity statistical tolerance > 0.1, or variance tolerance factor < 10 are selected. These characteristic variables include sociodemographic characteristics, clinical manifestations of hyperandrogenism, anthropometry indicators, past medical history and family medical history, and lifestyle behaviors. Based on these characteristic variables, a multi-dimensional, non-invasive, multi-indicator combined PCOS screening indicator pool is constructed.
[0027] Step 2: Disease Screening Model Construction 1. Establish a dataset based on the information collected from the screening indicator pool, and perform a balance test on the internal and external datasets to ensure the validity of the dataset; 2. Using R 4.5.2, pre-defined machine learning algorithms are employed to perform disease prediction feature value analysis and risk model construction on internal and external datasets. These pre-defined machine learning algorithms include Logistic Regression, XGBoost, DT (Decision Tree), RF (Random Forest), SVM (Support Vector Machine), and NB (Naive Bayes). Specific operations include: 1) Model building and fitting: The above six machine learning algorithms were used to build models. Grid search combined with 10-fold cross-validation was used to fine-tune the parameters of all models to complete the model fitting. The parameter tuning range for the XGBoost model was: learning rate 0.01-0.1, tree depth 3-10, minimum number of splits 2-10, and minimum number of leaves 1-5 to ensure good model fitting.
[0028] 2) Analysis of the screening and prediction capabilities of individual models: The screening and prediction capabilities of individual models are comprehensively evaluated through confusion matrix, ROC curve analysis and their 95% CI; 3) Screening model selection: By comprehensively analyzing the accuracy, sensitivity, specificity, positive predictive value, F1 score, AUC score, and Youden index of each model, and using the Delong method to compare the performance differences between models, the XGBoost model with the best performance was selected as the risk model for early PCOS screening.
[0029] Step 3: Building a Personalized Visual Disease Screening Solution Considering the clinical heterogeneity of PCOS, the interpretability visualization analysis of the predictor variable set of the selected XGBoost screening model was performed using the SHAP (SHapley Additive exPlanations) method. This resulted in the ranking of the importance of PCOS risk predictor variables, leading to a visualized disease screening plan, specifically including: 1. Risk Assessment and Applicability Analysis of the Optimal Model: Based on the applicability requirements of the model in specific cases, SHAP feature importance bar charts and feature contribution distribution summary plots are used to visualize and interpret the features of the XGBoost model, clarifying the importance and contribution of each predictor variable; for example... Figure 1 show, The comparison results of the six models' capabilities in actual screening are shown in the table below. Figure 1 The results showed that in the test set, 90 subjects were identified as not having PCOS by the XGBoost model and were also clinically diagnosed as not having PCOS; 8 subjects were identified as not having PCOS by the XGBoost regression model but were clinically diagnosed as having PCOS; 42 subjects were identified as having PCOS by the XGBoost regression model but were clinically diagnosed as not having PCOS; and 44 subjects were identified as having PCOS by the XGBoost regression model and were clinically diagnosed as having PCOS. XGBoost demonstrated superior early screening capabilities.
[0030] 2. Optimal Model Interpretability and Case Analysis: The SHAP waterfall plot is used to perform personalized analysis and prediction of empirical cases through visual screening. The baseline value is determined based on the mean SHAP value of healthy population samples. When the SHAP value of a case is higher than the baseline value, it is judged as a suspected PCOS patient, realizing personalized screening and management of high-risk groups for PCOS.
[0031] The importance ranking and contribution of the XGBoost PCOS screening model to predictor variables are shown in [see...]. Figures 2-3 .
[0032] from Figure 2The study clearly shows the importance of 12 predictive factors (hirsutism, acne, lifestyle, WHtR, personal medical history, place of residence, age at menarche, BRI, father's PCOS-related disease status, BMI, WHR, and mother's PCOS-related disease status) in the prediction of this disease screening.
[0033] from Figure 3 The positive and negative relationships between the feature contribution and the SHAP value can be clearly seen. A positive SHAP value indicates that the feature has a positive effect on the prediction result, while a negative value indicates that the feature has a negative effect on the prediction result. Figure 3 This is a visual analysis for personalized management of screening subjects in practical applications. This case is predicted to be a suspected PCOS patient. The sample's SHAP value of 0.645 is greater than the baseline value of -0.964, indicating that the patient may have PCOS.
[0034] In this embodiment of the invention, all data processing and model building are completed based on non-invasive indicators, without the need for invasive testing and complex gene screening procedures. It is easy to operate, cost-controllable, and suitable for large-scale PCOS early screening in primary healthcare institutions.
[0035] The above description is merely a specific embodiment of the present invention, enabling those skilled in the art to understand or implement the invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the invention. Therefore, the present invention is not to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features of the invention herein.
Claims
1. A multidimensional, non-invasive combined screening method for polycystic ovary syndrome, characterized in that, Includes the following steps: S1. Data Acquisition and Screening Indicator Pool Construction: A hybrid research method including literature review and expert consultation was adopted, combined with statistical analysis methods such as univariate analysis, correlation analysis and collinearity analysis, to select multidimensional non-invasive characteristic variables and construct a PCOS screening indicator pool. S2. Construction of a multi-dimensional non-invasive joint screening model: The optimal risk model suitable for early PCOS screening is modeled and comprehensively evaluated using a pre-set machine learning algorithm. At the same time, the screening efficacy is compared and analyzed with that of traditional single indicators. S3. Construction of Personalized Visualized Disease Screening Program: The SHAP method is used to perform a visual analysis of the interpretability of the predictive variable set of the optimal risk model, and the importance ranking of the PCOS risk prediction feature values is completed to form a visualized disease screening program.
2. The screening method according to claim 1, characterized in that, In step S1, evidence from clinical guidelines, research papers, clinical practice, and actual screening work is collected and analyzed using the hybrid research method to form a screening method information collection questionnaire and collect information through face-to-face surveys to establish an information database.
3. The screening method according to claim 1, characterized in that, In step S1, SPSS 25.0 software is used to establish a database and enter survey information to complete data cleaning, verification, description and analysis; the statistical analysis methods include univariate analysis, correlation analysis and collinearity analysis, and the variable selection of feature values is completed through two rounds of expert consultation.
4. The screening method according to claim 1, characterized in that, The multidimensional non-invasive characteristic variables mentioned in step S1 include five non-invasive dimensions: sociodemographic characteristics, clinical manifestations of hyperandrogenism, anthropometry indicators, past medical history and family medical history, and lifestyle behaviors. The screening indicator pool constructed based on the characteristic variables contains 13 non-invasive screening indicators.
5. The screening method according to claim 1, characterized in that, The preset machine learning algorithms mentioned in step S2 include Logistic Regression Model, XGBoost Model, DT Model, RF Model, SVM Model, and NB Model.
6. The screening method according to claim 1, characterized in that, The model building process in step S2 includes model modeling and fitting, analysis of the screening and prediction capabilities of individual models, priority ranking of screening models, risk assessment and interpretability analysis of the optimal model; among them, the grid search method combined with ten-fold cross-validation is used to optimize the parameters of all models, and the screening and prediction capabilities of individual models are evaluated by confusion matrix, ROC curve analysis and its 95% CI. At the same time, the screening efficacy is compared and analyzed with that of traditional single indicators.
7. The screening method according to claim 1, characterized in that, In step S2, the accuracy, sensitivity, specificity, positive predictive value, F1 score, AUC score and Youden index of each model are comprehensively analyzed, and the Delong method is used to compare the performance differences between models. The XGBoost model is selected as the optimal risk model for early screening of PCOS.
8. The screening method according to claim 1, characterized in that, In step S3, the SHAP feature importance bar chart and feature contribution distribution summary chart are used to complete the feature value importance ranking of the optimal risk model. The SHAP waterfall chart is used to complete the applicability of the optimal risk model and the case analysis of PCOS screening. The feature interpretation analysis of the optimal risk model is then carried out comprehensively.
9. The screening method according to claim 8, characterized in that, The importance of PCOS risk predictors obtained by the SHAP method, in descending order, is hirsutism, acne, lifestyle behavior, waist-to-height ratio, personal medical history, place of residence, age at menarche, body fat percentage, paternal PCOS-related disease status, body mass index, waist-to-hip ratio, and maternal PCOS-related disease status. A positive SHAP value indicates that the corresponding feature has a positive effect on the PCOS prediction result, while a negative value indicates that the corresponding feature has a negative effect on the PCOS prediction result.
10. An XGBoost model constructed using the screening method described in any one of claims 1-9, characterized in that, The model is constructed based on 13 non-invasive indicators from five non-invasive dimensions: sociodemographic characteristics, clinical manifestations of hyperandrogenism, anthropometry indicators, past medical history and family medical history, and lifestyle behaviors. It enables non-invasive early screening of PCOS and personalized, visualized management of high-risk groups.