A machine learning model construction method for predicting genetic breast cancer risk using breast ultrasound images
By constructing a machine learning model based on breast ultrasound images, the problems of high cost and long time consumption in gene testing have been solved, enabling rapid and non-invasive risk assessment of hereditary breast cancer. This model is applicable to low- and middle-income areas, improves the accessibility and timeliness of risk assessment, and provides important clinical decision-making basis.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- CHINA THREE GORGES UNIV
- Filing Date
- 2026-02-28
- Publication Date
- 2026-06-19
AI Technical Summary
Existing gene testing technologies are expensive, time-consuming, and poorly accessible, failing to provide rapid decision-making support for the risk of hereditary breast cancer. Furthermore, imaging diagnostics are highly complex and cannot fully cover multi-gene panel testing.
A machine learning model was constructed to predict the risk of hereditary breast cancer by preprocessing breast ultrasound images, segmenting regions, extracting and selecting features, and combining radiomics scores with clinical parameters. The model was trained and validated using multiple machine learning classifiers.
It provides rapid, non-invasive, and economical risk assessment for hereditary breast cancer, suitable for low- and middle-income areas, reducing reliance on expensive equipment, improving the accessibility and timeliness of risk assessment, and providing important reference for clinical decision-making.
Smart Images

Figure CN122244503A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of medical image analysis and artificial intelligence diagnosis, and in particular relates to a method for constructing a machine learning model to predict the risk of hereditary breast cancer using breast ultrasound images. Background Technology
[0002] Currently, genetic risk assessment for breast cancer primarily relies on multi-gene panel testing to identify pathogenic / probable pathogenic variants in seven major high-penetration genes: BRCA1, BRCA2, TP53, PTEN, CDH1, PALB2, and STK11, thereby guiding clinical decisions such as prophylactic mastectomy. However, traditional genetic testing is costly, has long lead times, and uneven accessibility globally. Particularly in low- and middle-income countries, only BRCA1 / 2 testing is often available, failing to comprehensively cover all seven major high-penetration genes, making it difficult to uniformly implement clinical guidelines. Furthermore, some tumors carrying pathogenic variants may present with similar benign features on imaging, increasing diagnostic complexity.
[0003] In recent years, radiomics and machine learning technologies have been introduced into medical image analysis to assist in disease diagnosis and classification by extracting quantitative features from images. Existing studies have attempted to predict BRCA gene status using breast ultrasound and MRI images, but most have focused only on BRCA1 / 2 and have not incorporated multi-gene panel analysis. Furthermore, while MRI-based methods have shown some effectiveness, their expensive equipment and limited availability restrict their clinical application. Therefore, it is necessary to design a machine learning model construction method for predicting the risk of hereditary breast cancer using breast ultrasound images to address these issues. Summary of the Invention
[0004] The technical problem to be solved by this invention is to provide a machine learning model construction method for predicting the risk of hereditary breast cancer using breast ultrasound images. It aims to solve the problems of high cost, long time consumption, poor accessibility and inability to quickly provide decision-making basis of existing gene testing technologies. It has the advantages of being non-invasive, efficient and economical, and can provide a fast and reliable auxiliary reference for clinical preventive surgery decisions and individualized risk management.
[0005] To achieve the above-mentioned technical effects, the technical solution adopted by the present invention is as follows: A method for constructing a machine learning model to predict the risk of hereditary breast cancer using breast ultrasound images includes the following steps: S1: Obtain standardized breast ultrasound images of patients with pathologically confirmed breast cancer before treatment, and perform noise reduction and intensity normalization preprocessing on the images. S2, perform region segmentation on the preprocessed image to segment the tumor core region and the peritumoral region as the region of interest; S3 utilizes radiomics tools to extract high-throughput quantitative features from two regions of interest: the tumor core region and the peritumoral region, including intensity features, texture features, shape features, and deep learning features. S4. Key features were screened through a three-step feature selection process. First, the Mann-Whitney U test was used in combination with multiple comparison correction to initially screen significant features. Second, redundant features were removed by Spearman correlation analysis. Finally, the minimum absolute shrinkage and selection operator LASSO regression were applied to further refine and construct the radiomics score. S5 integrates the constructed radiomics score with clinical parameters and uses multiple machine learning classifiers to build a predictive model to determine whether patients carry high-risk pathogenic gene mutations for breast cancer. S6 divides the sample data into training and testing sets, trains and validates the model, and evaluates it.
[0006] Preferably, in step S1, patients diagnosed with breast cancer by multi-gene panel testing are retrospectively included. Multi-gene panel testing includes BRCA1, BRCA2, TP53, PTEN, CDH1, PALB2, and STK11. The screening criteria are: having standard pre-treatment breast ultrasound grayscale images, complete multi-gene testing results, and postoperative pathological data.
[0007] Preferably, in step S1, the image preprocessing uses VanceAI Denoise software for noise reduction and intensity normalization.
[0008] Preferably, in step S2, 3D slicer software is used to accurately segment each lesion and define two key regions of interest: the tumor region, which covers the solid core of the tumor, including posterior acoustic shadowing or enhancement; and the peritumoral region, which is defined as the tissue area within 0.5-1 cm outside the tumor boundary.
[0009] Preferably, in step S3, quantitative features are extracted in batches from two regions of interest (ROIs) using the Pyradiomics extension package in Python; the feature set covers intensity statistics, texture features, morphological features, and higher-order features derived through deep learning, including gray-level co-occurrence matrix features.
[0010] Preferably, in step S4, the steps for establishing the three-step feature selection process and the radiomics scoring calculation are as follows: S401, high-throughput features extracted from tumor and peritumoral regions were screened for univariate significance, the Mann-Whitney U test was used to assess the differences in features between the pathogenicity positive and negative groups, and the Benjamini-Hochberg method was used for multiple comparison correction, retaining features with a corrected p-value less than 0.050. S402, perform redundancy analysis on the screened features, calculate the Spearman correlation coefficient between all features, and when the absolute value of the correlation coefficient between any two features is greater than 0.9, remove the one with a higher average correlation with other features and retain the independent feature. S403, LASSO regression is applied to further reduce the dimensionality of the retained features and determine the coefficients, and the optimal regularization parameter is selected through 10-fold cross-validation. Finally, a subset of core features with non-zero regression coefficients is obtained; the radiomics score is calculated through the following linear combination: ; in, Representing the Selected radiomics features, The non-zero coefficients corresponding to this feature obtained through LASSO regression are: This represents the total number of features ultimately selected. S404, For a model that only contains the tumor region, the Rad-Score is denoted as Rad-Score1; for a model that simultaneously contains the tumor and the peritumoral region, the Rad-Score is denoted as Rad-Score2. S405 uses the calculated Rad-Score and clinical variables as input features to construct the final pathogenic gene variant prediction model.
[0011] Preferably, in step S5, Rad-Score and clinical features are used as prediction variables, genetic status is used as classification label, and multiple machine learning classifiers are used to model the two Rad-Scores respectively; the dataset is randomly divided into training set and independent test set according to a fixed ratio, and hyperparameters are tuned using ten-fold cross-validation on the training set.
[0012] Preferably, in step S6, the performance of various machine learning classifier models built based on different feature sets is evaluated on a reserved independent test set, including a comprehensive evaluation of the classification performance and clinical application potential of the models through ROC curves, accuracy, sensitivity, specificity and predicted value indicators.
[0013] Preferably, a computer device includes a memory and a processor, which are communicatively connected to each other. The memory stores computer instructions, and the processor executes the computer instructions to perform the aforementioned method for constructing a machine learning model for predicting the risk of hereditary breast cancer using breast ultrasound images.
[0014] Preferably, a computer-readable storage medium stores computer instructions for causing a computer to execute the aforementioned method for constructing a machine learning model to predict the risk of hereditary breast cancer using breast ultrasound images.
[0015] The beneficial effects of this invention are as follows: 1. Based on widely available and low-cost ultrasound imaging, this invention can serve as an effective auxiliary or alternative tool to alleviate the uneven distribution of genetic testing services due to geographical and economic conditions, enabling more patients to obtain preliminary genetic risk assessments.
[0016] 2. By analyzing readily available preoperative ultrasound images, this invention can provide predictive results within hours, overcoming the limitations of traditional gene testing which requires a waiting period of several weeks. This helps to quickly identify high-risk patients before surgery or in the early stages of treatment planning, avoiding secondary surgery due to delayed results.
[0017] 3. As a non-invasive imaging analysis method, this invention can reduce the reliance on expensive polygenic panel tests, and is especially suitable for low- and middle-income areas or as a large-scale primary screening tool, thereby optimizing the allocation of medical resources and reducing the economic burden on patients and the medical insurance system.
[0018] 4. This invention is completely non-invasive, requiring only routine ultrasound examination images, avoiding additional blood draws or tissue sampling, and has high patient acceptance, providing a brand-new risk assessment window for patients who are unwilling or unable to undergo gene testing immediately.
[0019] 5. By integrating the microscopic texture information of the tumor and the peritumoral area, the model can provide objective quantitative indicators that surpass visual interpretation, providing important imaging evidence and auxiliary reference for clinicians when formulating personalized management plans such as preventive surgery and enhanced monitoring.
[0020] 6. This invention correlates radiomics features with genetic variation information, representing a concrete practice in the field of radiogenomics. It helps to promote the integration of multi-dimensional data such as imaging, genes, and pathology, laying a methodological foundation for building a more comprehensive and accurate model for breast cancer diagnosis and prognosis prediction. This prediction method based on ultrasound imaging and machine learning is expected to play an important role in improving the accessibility, timeliness, and cost-effectiveness of breast cancer genetic risk assessment, assisting clinicians in making earlier and more personalized intervention decisions, and has significant clinical translational potential and social benefits. Attached Figure Description
[0021] Figure 1 This is a flowchart of a method for predicting pathogenic variants in breast cancer based on ultrasound imaging and machine learning, according to the present invention. Figure 2This is a flowchart of the population screening process for this invention. Figure 3 This is a schematic diagram of ultrasound images and segmentation of BRCA1-positive patients according to the present invention; Figure 4 This is the process for selecting LASSO regression features of the tumor region and the peritumoral region in this invention; Figure 5 The present invention uses random forests and K-nearest neighbor classifiers to derive receiver operating characteristic (ROC) curves for predicting pathogenic variants based on a tumor region-only model. Detailed Implementation
[0022] Example 1: A method for constructing a machine learning model to predict the risk of hereditary breast cancer using breast ultrasound images includes the following steps: S1: Obtain standardized breast ultrasound images of patients with pathologically confirmed breast cancer before treatment, and perform noise reduction and intensity normalization preprocessing on the images. S2, perform region segmentation on the preprocessed image to segment the tumor core region and the peritumoral region as the region of interest; S3 utilizes radiomics tools to extract high-throughput quantitative features from two regions of interest: the tumor core region and the peritumoral region, including intensity features, texture features, shape features, and deep learning features. S4. Key features were screened through a three-step feature selection process. First, the Mann-Whitney U test was used in combination with multiple comparison correction to initially screen significant features. Second, redundant features were removed by Spearman correlation analysis. Finally, the minimum absolute shrinkage and selection operator LASSO regression were applied to further refine and construct the radiomics score. S5 integrates the constructed radiomics score with clinical parameters and uses multiple machine learning classifiers to build a predictive model to determine whether patients carry high-risk pathogenic gene mutations for breast cancer. S6 divides the sample data into training and testing sets, trains and validates the model, and evaluates it.
[0023] Preferably, in step S1, patients diagnosed with breast cancer by multi-gene panel testing are retrospectively included. Multi-gene panel testing includes BRCA1, BRCA2, TP53, PTEN, CDH1, PALB2, and STK11. The screening criteria are: having standard pre-treatment breast ultrasound grayscale images, complete multi-gene testing results, and postoperative pathological data.
[0024] Preferably, in step S1, the image preprocessing uses VanceAI Denoise software for noise reduction and intensity normalization.
[0025] Preferably, in step S2, 3D slicer software is used to accurately segment each lesion and define two key regions of interest: the tumor region, which covers the solid core of the tumor, including posterior acoustic shadowing or enhancement; and the peritumoral region, which is defined as the tissue area within 0.5-1 cm outside the tumor boundary.
[0026] Preferably, in step S3, quantitative features are extracted in batches from two regions of interest (ROIs) using the Pyradiomics extension package in Python; the feature set covers intensity statistics, texture features, morphological features, and higher-order features derived through deep learning, including gray-level co-occurrence matrix features.
[0027] Preferably, in step S4, the steps for establishing the three-step feature selection process and the radiomics scoring calculation are as follows: S401, high-throughput features extracted from tumor and peritumoral regions were screened for univariate significance, the Mann-Whitney U test was used to assess the differences in features between the pathogenicity positive and negative groups, and the Benjamini-Hochberg method was used for multiple comparison correction, retaining features with a corrected p-value less than 0.050. S402, perform redundancy analysis on the screened features, calculate the Spearman correlation coefficient between all features, and when the absolute value of the correlation coefficient between any two features is greater than 0.9, remove the one with a higher average correlation with other features and retain the independent feature. S403, LASSO regression is applied to further reduce the dimensionality of the retained features and determine the coefficients, and the optimal regularization parameter is selected through 10-fold cross-validation. Finally, a subset of core features with non-zero regression coefficients is obtained; the radiomics score is calculated through the following linear combination: ; in, Representing the Selected radiomics features, The non-zero coefficients corresponding to this feature obtained through LASSO regression are: This represents the total number of features ultimately selected. S404, For a model that only contains the tumor region, the Rad-Score is denoted as Rad-Score1; for a model that simultaneously contains the tumor and the peritumoral region, the Rad-Score is denoted as Rad-Score2. S405 uses the calculated Rad-Score and clinical variables as input features to construct the final pathogenic gene variant prediction model.
[0028] Preferably, in step S5, Rad-Score and clinical features are used as prediction variables, genetic status is used as classification label, and multiple machine learning classifiers are used to model the two Rad-Scores respectively; the dataset is randomly divided into training set and independent test set according to a fixed ratio, and hyperparameters are tuned using ten-fold cross-validation on the training set.
[0029] Preferably, in step S6, the performance of various machine learning classifier models built based on different feature sets is evaluated on a reserved independent test set, including a comprehensive evaluation of the classification performance and clinical application potential of the models through ROC curves, accuracy, sensitivity, specificity and predicted value indicators.
[0030] Preferably, a computer device includes a memory and a processor, which are communicatively connected to each other. The memory stores computer instructions, and the processor executes the computer instructions to perform the aforementioned method for constructing a machine learning model for predicting the risk of hereditary breast cancer using breast ultrasound images.
[0031] Preferably, a computer-readable storage medium stores computer instructions for causing a computer to execute the aforementioned method for constructing a machine learning model to predict the risk of hereditary breast cancer using breast ultrasound images.
[0032] Example 2: like Figure 1 As shown, this embodiment provides a method for predicting pathogenic variants in breast cancer based on ultrasound imaging and machine learning, including: S1, Study Design and Patient Screening: This retrospective study, approved by the ethics committee (approval number: 131 / 2024), included 240 patients with pathologically confirmed breast cancer. All patients underwent breast ultrasound examination and multi-gene panel testing before treatment, covering seven major penetrating genes: BRCA1, BRCA2, TP53, PTEN, CDH1, PALB2, and STK11. Ultimately, 88 tumor samples meeting the inclusion criteria were included in the analysis, of which 50 were in the pathogenic / probable pathogenic P / LP variant group and 38 were in the non-pathogenic group. The study population screening flowchart is shown below. Figure 2 .
[0033] S2, Ultrasound Image Acquisition and Region Segmentation: Grayscale ultrasound images were acquired using a Mindray DC-80B and Samsung RS80A ultrasound system with a linear probe frequency of L4-16A (-6.6~13.0MHz). Two regions of interest (ROIs) were manually segmented by an experienced breast radiologist. The core area of the tumor, and the peritumoral area of the tumor and its surrounding area of 0.5–1 cm.
[0034] The segmentation process was completed using 3D slicer software and reviewed and corrected by a senior physician. An ultrasound image and segmentation diagram of a BRCA1-positive patient are shown below. Figure 3 As shown: The ultrasound image showed a hypoechoic, irregularly shaped mass with indistinct, lobular margins. The red area represents the manually segmented tumor region.
[0035] S3, Radiomics Feature Extraction and Preprocessing: After all images were denoised and intensity normalized using VanceAI Denoise software, 306 radiomics features were extracted from the two ROIs using the Python extension package Pyradiomics (version 3.1.0), including intensity, texture, morphological features, and features generated based on deep learning.
[0036] S4, Feature Selection and Model Building: A three-step feature selection strategy is adopted: 1. Univariate analysis: The Mann-Whitney U test was used to screen for features that showed significant differences between the P / LP group and the non-P / LP group. After adjustment, p < 0.05. 2. Correlation analysis: Highly redundant features with Spearman correlation coefficients > 0.9 or < -0.9 were removed; 3. LASSO regression: Features were further screened using LASSO logistic regression with 10-fold cross-validation. Finally, 7 features were selected from the tumor region and 5 features were selected from the tumor + peritumoral region to construct the radiomics score (Rad-Score).
[0037] The process of selecting LASSO regression features for the tumor region and the peritumoral region is as follows: Figure 4 As shown, the subplot above uses 10-fold cross-validation to filter and optimize parameters. The results show that the left image corresponds to the tumor area, and the right image corresponds to the peritumoral area; Using as the independent variable, the binomial bias values of the minimum absolute contraction and selection operator regression cross-validation model were plotted.
[0038] The LASSO coefficient spectra of 7 and 5 radiomics features with non-zero coefficients, extracted from the tumor region (left image) and peritumoral region (right image) data, are shown in Table 1 below. These spectra are used to construct the final radiomics features and coefficient data of Rad-Score.
[0039] Table 1: Final radiomics features used to calculate Rad-Score1 and Rad-Score2 predictors;
[0040] S5, Machine Learning Model Training and Evaluation: The dataset was divided into training and testing sets in a 7:3 ratio. Multiple classifiers, including Random Forest, K-Nearest Neighbors, and Support Vector Machine, were used, with Rad-Score and Ki67% proliferation index as predictors, to construct a pathogenic variant prediction model. Model performance was evaluated using ROC curves, accuracy, sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV). ROC curves for Random Forest and K-Nearest Neighbors classifiers based on tumor region and combined peritumoral region features are shown below. Figure 5 As shown.
[0041] The performance metrics of the tumor region model Rad-Score1 and the tumor + peritumoral region model Rad-Score2 are shown in Tables 2 and 3, respectively.
[0042] Table 2: Based on Ki67% and Rad Score1's predictive performance on pathogenic variants;
[0043] Table 3: Based on Ki67% and Rad Score2's performance in predicting pathogenic variants;
[0044] S6, Model Performance Summary: As shown in Tables 2 and 3 above, the optimal model is the K-nearest neighbor classifier based on tumor and peritumoral region features, with an average AUC of 0.930, accuracy of 80.9%, sensitivity of 91.6%, specificity of 66.6%, PPV of 78.5%, and NPV of 85.7%. These results indicate that radiomics models incorporating peritumoral region features have high potential in predicting pathogenic variants in breast cancer and can serve as an auxiliary or alternative tool to genetic testing, particularly suitable for clinical scenarios with limited resources or long testing wait times.
Claims
1. A method for constructing a machine learning model to predict the risk of hereditary breast cancer using breast ultrasound images, characterized in that, Includes the following steps: S1: Obtain standardized breast ultrasound images of patients with pathologically confirmed breast cancer before treatment, and perform noise reduction and intensity normalization preprocessing on the images. S2, perform region segmentation on the preprocessed image to segment the tumor core region and the peritumoral region as the region of interest; S3 utilizes radiomics tools to extract high-throughput quantitative features from two regions of interest: the tumor core region and the peritumoral region, including intensity features, texture features, shape features, and deep learning features. S4. Key features were screened through a three-step feature selection process. First, the Mann-Whitney U test was used in combination with multiple comparison correction to initially screen significant features. Second, redundant features were removed by Spearman correlation analysis. Finally, the minimum absolute shrinkage and selection operator LASSO regression were applied to further refine and construct the radiomics score. S5 integrates the constructed radiomics score with clinical parameters and uses multiple machine learning classifiers to build a predictive model to determine whether patients carry high-risk pathogenic gene mutations for breast cancer. S6 divides the sample data into training and testing sets, trains and validates the model, and evaluates it.
2. The method for constructing a machine learning model for predicting the risk of hereditary breast cancer using breast ultrasound images according to claim 1, characterized in that, In step S1, patients diagnosed with breast cancer by multi-gene panel testing were retrospectively included. Multi-gene panel testing included BRCA1, BRCA2, TP53, PTEN, CDH1, PALB2, and STK11. The screening criteria were: having standard pre-treatment breast ultrasound grayscale images, complete multi-gene testing results, and postoperative pathological data.
3. The method for constructing a machine learning model for predicting the risk of hereditary breast cancer using breast ultrasound images according to claim 2, characterized in that, In step S1, image preprocessing uses VanceAI Denoise software for noise reduction and intensity normalization.
4. The method for constructing a machine learning model for predicting the risk of hereditary breast cancer using breast ultrasound images according to claim 1, characterized in that, In step S2, 3D slicer software is used to precisely segment each lesion and define two key regions of interest: the tumor region, which covers the solid core of the tumor, including posterior acoustic shadowing or enhancement; and the peritumoral region, which is defined as the tissue area within 0.5-1 cm outside the tumor boundary.
5. The method for constructing a machine learning model for predicting the risk of hereditary breast cancer using breast ultrasound images according to claim 1, characterized in that, In step S3, quantitative features are extracted in batches from two regions of interest (ROIs) using the Pyradiomics extension package in Python. The feature set covers intensity statistics, texture features, morphological features, and higher-order features derived from deep learning. Morphological features include gray-level co-occurrence matrix features.
6. The method for constructing a machine learning model for predicting the risk of hereditary breast cancer using breast ultrasound images according to claim 1, characterized in that, In step S4, the three-step feature selection process and the steps for establishing the radiomics scoring calculation are as follows: S401, high-throughput features extracted from tumor and peritumoral regions were screened for univariate significance, the Mann-Whitney U test was used to assess the differences in features between the pathogenicity positive and negative groups, and the Benjamini-Hochberg method was used for multiple comparison correction, retaining features with a corrected p-value less than 0.
050. S402, perform redundancy analysis on the screened features, calculate the Spearman correlation coefficient between all features, and when the absolute value of the correlation coefficient between any two features is greater than 0.9, remove the one with a higher average correlation with other features and retain the independent feature. S403, LASSO regression is applied to further reduce the dimensionality of the retained features and determine the coefficients, and the optimal regularization parameter is selected through ten-fold cross-validation. Finally, a subset of core features with non-zero regression coefficients is obtained; the radiomics score is calculated through the following linear combination: ; in, Representing the Selected radiomics features, The non-zero coefficients corresponding to this feature obtained through LASSO regression are: This represents the total number of features ultimately selected. S404, For a model that only contains the tumor region, the Rad-Score is denoted as Rad-Score1; for a model that contains both the tumor and the peritumoral region, the Rad-Score is denoted as Rad-Score2. S405 uses the calculated Rad-Score and clinical variables as input features to construct the final pathogenic gene variant prediction model.
7. The method for constructing a machine learning model for predicting the risk of hereditary breast cancer using breast ultrasound images according to claim 6, characterized in that, In step S5, Rad-Score and clinical features are used as prediction variables, and genetic status is used as classification label. Multiple machine learning classifiers are used to model the two Rad-Scores respectively. The dataset is randomly divided into training set and independent test set according to a fixed ratio. Ten-fold cross-validation is used on the training set for hyperparameter tuning.
8. The method for constructing a machine learning model for predicting the risk of hereditary breast cancer using breast ultrasound images according to claim 7, characterized in that, In step S6, the performance of various machine learning classifier models built on different feature sets is evaluated on a reserved independent test set. This includes a comprehensive evaluation of the model's classification performance and clinical application potential through ROC curves, accuracy, sensitivity, specificity, and predicted value metrics.
9. A computer device, characterized in that, It includes a memory and a processor, which are interconnected and communicate with each other. The memory stores computer instructions, and the processor executes the computer instructions to perform the machine learning model construction method for predicting the risk of hereditary breast cancer using breast ultrasound images as described in any one of claims 1 to 8.
10. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores computer instructions for causing a computer to perform a method for constructing a machine learning model for predicting the risk of hereditary breast cancer using breast ultrasound images, as described in any one of claims 1 to 8.