An artificial intelligence-based early screening system for pancreatic cancer

By establishing a database and using logistic regression algorithms to identify and eliminate false data, the accuracy and efficiency of early pancreatic cancer screening have been improved, solving the problem of unclear data affecting screening results.

CN122290972APending Publication Date: 2026-06-26NANTONG TUMOR HOSPITAL

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
NANTONG TUMOR HOSPITAL
Filing Date
2026-04-16
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing pancreatic cancer screening systems suffer from poor accuracy and reliability due to the uncertainty of data authenticity, failing to accurately reflect the patient's actual condition and affecting early diagnosis.

Method used

A database containing characteristics of pancreatic cancer is established. Text information from patient medical records and examination reports is extracted through an identification unit. The image processing module improves data clarity, the feature extraction module extracts key features, and the predictive analysis module uses a logistic regression algorithm to remove false data and outputs screening results.

Benefits of technology

This improves the accuracy and efficiency of early pancreatic cancer screening, ensures data quality, reduces misdiagnosis, and enables the rational allocation of medical resources.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122290972A_ABST
    Figure CN122290972A_ABST
Patent Text Reader

Abstract

This invention relates to an artificial intelligence-based early pancreatic cancer screening system, belonging to the field of early pancreatic cancer screening technology. To address the problem of existing screening systems having scattered and incomplete data collection, leading to inconsistent data and affecting screening results, this invention establishes a database containing pancreatic cancer disease characteristics. The database encompasses multi-dimensional feature information, presenting corresponding pancreatic cancer characteristic data according to different stages of disease development. This AI-based early pancreatic cancer screening system extracts key features through a feature extraction module, greatly reducing data redundancy and significantly improving data processing efficiency. The predictive analysis module uses data analysis and algorithms to perform authenticity screening on the extracted raw data and then performs early pancreatic cancer screening on the authentic data, avoiding misjudgments caused by invalid data interference and improving the quality and accuracy of early pancreatic cancer screening data.
Need to check novelty before this filing date? Find Prior Art

Description

TECHNICAL FIELD

[0001] The application relates to the technical field of early cancer screening of pancreatic cancer, in particular to an early pancreatic cancer screening system based on artificial intelligence. BACKGROUND

[0002] Pancreatic cancer is a malignant tumor originating from pancreatic duct epithelial and acinar cells, which is insidious and highly invasive. As an important digestive and endocrine organ of the human body, the pancreas has both exocrine and endocrine functions. When the pancreatic cells undergo genetic mutation and abnormal proliferation to form a tumor, it is pancreatic cancer. The early symptoms are atypical, such as abdominal pain, indigestion, loss of appetite, etc., which are easily confused with common digestive system diseases, which is an important reason why most patients are diagnosed in the middle and late stages. Patients in the middle and late stages may have symptoms such as jaundice, sudden weight loss, and increased back pain.

[0003] In recent years, the incidence of pancreatic cancer has shown an increasing trend worldwide. Because of its insidious early symptoms, most patients are diagnosed in the middle and late stages, missing the best treatment opportunity. When seeking medical treatment, patients often go to different hospitals for treatment, and their clinical test data, medical image data and medical record text information are extremely scattered. There are barriers between hospital information systems, and data sharing is difficult, which makes it difficult for the screening system to collect all the required data. At the same time, errors may occur in the transmission and input of some data, or there may be false components due to human factors, which makes the data obtained by the screening system half true and half false, interferes with the normal operation of the screening system, and cannot accurately reflect the actual situation of the patient, greatly affecting the accuracy and reliability of the screening results.

[0004] To solve the above problems, an early pancreatic cancer screening system based on artificial intelligence is provided. SUMMARY

[0005] The purpose of the present application is to provide an early pancreatic cancer screening system based on artificial intelligence, which solves the problem that the data obtained by the screening system is half true and half false, interferes with the normal operation of the screening system, cannot accurately reflect the actual situation of the patient, and greatly affects the accuracy and reliability of the screening results.

[0006] To achieve the above purpose, the present application provides the following technical scheme: an early pancreatic cancer screening system based on artificial intelligence, comprising:

[0007] Database: a database containing pancreatic cancer disease characteristics is established, which includes multi-dimensional feature information and presents corresponding pancreatic cancer characteristic data according to different stages of disease development;

[0008] Recognition unit: used for recognizing text information and feature extraction in patient medical records and examination report documents;

[0009] Interactive Unit: Used to provide users with an intuitive and convenient operating interface, provide detailed data analysis results and charts, and inform users of the risk of pancreatic cancer.

[0010] Preferably, the feature information in the database includes CA19-9 index, high signal on MRI DWI, decreased ADC value, abdominal pain symptoms, pancreatic nodule morphology, and characteristic data of main pancreatic duct diameter.

[0011] Preferably, the recognition unit includes a data acquisition module, an image processing module, and a feature extraction module;

[0012] Data acquisition module: By scanning and recognizing patients' paper medical records and examination reports, it provides basic data support for assessing the risk factors of pancreatic cancer and for subsequent screening.

[0013] Image processing module: Used to process pancreatic images acquired by medical imaging equipment, remove interference noise, improve image clarity, highlight the differences between pancreatic tissue and surrounding structures, and make the pancreatic outline and internal texture clearer and more distinguishable, with the aim of improving the accuracy and effect of subsequent processing;

[0014] Feature extraction module: Used to extract features related to pancreatic cancer from preprocessed medical images. These features include morphological features, texture features, density and signal features, and vascular features of pancreatic cancer, transforming complex medical image data into a feature representation with quantifiable indicators.

[0015] Preferably, the identification unit further includes a prediction analysis module and a result output module;

[0016] Predictive analysis module: Using data and algorithms, it identifies the input feature data and information from the data acquisition module, eliminates false data, and further focuses on the screening of pancreatic cancer patients for the retained data, and comprehensively analyzes and judges the probability that the patient is in the early stage of pancreatic cancer;

[0017] Results output module: The predictive analysis module generates pancreatic cancer screening results, and the identification unit outputs relevant screening information.

[0018] Preferably, the predictive analysis module screens early pancreatic cancer-related data using a logistic regression algorithm, and identifies and marks abnormal data points by setting a Z-score threshold and a Euclidean distance threshold.

[0019] Preferably, the steps of the logistic regression algorithm are as follows:

[0020] S1: Retrieve the dataset from the database and perform data preprocessing, including cleaning, transformation, and integration.

[0021] S2: Use Z-score to perform preliminary identification of outlier data points in the preprocessed data, determine whether the data point is a potential outlier data point, and use the set Euclidean distance threshold to perform a second judgment on the potential outlier data points initially identified by Z-score, determine the outlier data points, and remove the outlier data points from the dataset.

[0022] S3: Divide the processed data into training and test sets according to a certain ratio, build a logistic regression model based on the training set data, and determine the likelihood of the patient being in the early stage of pancreatic cancer based on the probability value output by the model.

[0023] Preferably, the Z-score refers to the Z-score of each data point in the dataset. According to the formula The calculated values, where Let μ be the original value of the i-th data point in the dataset, μ be the mean of the dataset, and σ be the standard deviation of the dataset. By setting a Z-score threshold, when the absolute value of the Z-score of a data point is greater than the Z-score threshold, the data point is determined to be a potential outlier for further analysis and processing.

[0024] Preferably, the Euclidean distance threshold is used to perform a secondary judgment on potential outlier data points initially identified by Z-score, calculating the Euclidean distance between the potential outlier data point and the normal data points in the dataset. When the Euclidean distance is greater than the set Euclidean distance threshold, the data point is finally determined to be an outlier data point. The Euclidean distance calculation method is as follows:

[0025] Suppose that each data point in the dataset has n feature dimensions, and the feature vector of the potential outlier data point P is... The feature vector of the normal data point Q is The Euclidean distance between potential outlier data point P and normal data point Q is... Calculate using the following formula:

[0026]

[0027] When the Euclidean distance When the distance exceeds the set Euclidean distance threshold, the potential outlier data point P is finally identified as an outlier data point.

[0028] Preferably, the basic form of the logistic regression model is as follows:

[0029]

[0030] in This represents the probability that a patient is in the early stage of pancreatic cancer, given the characteristic variable X. For the intercept term, For each characteristic variable, the regression coefficients are estimated using the maximum likelihood estimation method. Make an estimate.

[0031] Preferably, in the prediction analysis module, a trained logistic regression model is used to predict the test set data to obtain the predicted probability value of each sample. When the predicted probability value is greater than the threshold, the patient is judged to have a high probability of being in the early stage of pancreatic cancer; otherwise, it is judged to be low. The prediction analysis module makes the judgment and identification.

[0032] Compared with existing technologies, the beneficial effects of this invention are: this artificial intelligence-based early pancreatic cancer screening system can identify and eliminate existing false data, effectively preventing such false data from entering the analysis process, ensuring data quality, and improving the efficiency of early pancreatic cancer screening. The specific details are as follows:

[0033] 1. An interactive unit is provided, which mainly includes an acquisition module, an image processing module, a feature extraction module, a predictive analysis module, and a result output module. The acquisition and image processing modules perform noise reduction and enhancement operations on the acquired raw data, effectively improving the clarity and completeness of the data. The feature extraction module extracts key features, greatly reducing data redundancy and significantly improving data processing efficiency. The predictive analysis module uses data analysis and algorithms to screen the extracted raw data for authenticity, eliminating invalid data, and performing early pancreatic cancer screening on the real data, avoiding misjudgments caused by interference from invalid data, and improving the quality of early pancreatic cancer screening data and the accuracy of screening results. Attached Figure Description

[0034] Figure 1 This is a schematic diagram of the overall modules of the present invention;

[0035] Figure 2 The topology of this invention Figure One ;

[0036] Figure 3 The topology of this invention Figure Two ;

[0037] Figure 4 The flowchart of the pancreatic cancer data screening model of the present invention Figure One ;

[0038] Figure 5 The flowchart of the pancreatic cancer data screening model of the present invention Figure Two . Detailed Implementation

[0039] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0040] Please see Figure 1 This invention provides a technical solution: an artificial intelligence-based early pancreatic cancer screening system, comprising:

[0041] Database: Establish a database containing the characteristics of pancreatic cancer. The database includes multi-dimensional feature information and presents corresponding pancreatic cancer feature data according to different stages of disease development.

[0042] Recognition Unit: Used for recognizing text information and extracting features from patient medical records and examination reports;

[0043] Interactive Unit: Used to provide users with an intuitive and convenient operating interface, provide detailed data analysis results and charts, and inform users of the risk of pancreatic cancer.

[0044] The database contains features including CA19-9 index, high signal intensity on MRI DWI, decreased ADC value, abdominal pain symptoms, pancreatic nodule morphology, and main pancreatic duct diameter.

[0045] The identification unit includes:

[0046] Data acquisition module: By scanning and recognizing patients' paper medical records and examination reports, it provides basic data support for assessing the risk factors of pancreatic cancer and for subsequent screening.

[0047] Image processing module: Used to clean and process the data collected by the data acquisition module. First, the data is cleaned to remove outliers and missing values ​​to ensure data integrity and consistency. Then, the data is processed, including transformation and normalization operations, to facilitate subsequent data analysis and decision-making.

[0048] Feature extraction module: Used to extract features related to pancreatic cancer from preprocessed medical images. These features include morphological features, texture features, density and signal features, and vascular features of pancreatic cancer, transforming complex medical image data into a feature representation with quantifiable indicators.

[0049] The identification unit also includes:

[0050] Predictive analysis module: Using data and algorithms, it identifies the input feature data and information from the data acquisition module, eliminates false data, and further focuses the screening of pancreatic cancer patients on the retained data. It comprehensively analyzes and judges the probability that the patient is in the early stage of pancreatic cancer. This can identify and eliminate false data, avoid misjudgment due to erroneous data, and improve the rational allocation of medical resources.

[0051] Results Output Module: The pancreatic cancer screening results generated by the predictive analysis module, the identification unit outputs relevant screening information, and presents it to the user through the interactive unit. The user can use this screening information to initially plan the next examination or treatment suggestions. The whole process is convenient and efficient, greatly improving the user's understanding and application of pancreatic cancer screening results.

[0052] Specifically, a database containing various data types was established to record clinical laboratory data, supporting the system's early pancreatic cancer screening efforts. The data acquisition module collects information from multiple data sources, including text information from medical records and image data generated by imaging equipment such as CT and ultrasound, and then analyzes it. Next, the image processing module cleans and segments the acquired data, extracting pancreatic cancer-related features. First, the data is cleaned, identifying and correcting missing data to ensure accuracy and reliability. Then, the data is processed, including transformation and normalization, eliminating data reading obstacles caused by format differences, making it suitable for subsequent analysis and prediction. Next, the feature extraction module extracts key features related to pancreatic cancer from the preprocessed data. The predictive analysis module then utilizes… Using data and algorithms, the system performs outlier detection on the extracted features. In predictive analysis, it uses two indicators—Z-score threshold and Euclidean distance threshold—to identify abnormal feature values ​​that deviate from the normal data distribution range, recognize potential false data, and screen out false feature information that is unreasonable or contradictory. This ensures that the data entering the subsequent analysis process is authentic and reliable, improving the accuracy of the early pancreatic cancer screening model. Next, a logistic regression model is used to calculate the probability value of the sample belonging to the early stage of pancreatic cancer from the screened data without abnormalities. When the predicted probability value is greater than the threshold, the likelihood of the patient having early pancreatic cancer is considered high; otherwise, it is considered low. The results output module provides decision suggestions based on the prediction results, and the results are presented through an interactive unit.

[0053] Another technical solution provided by the present invention: the predictive analysis module screens early pancreatic cancer-related data using a logistic regression algorithm, and identifies and marks abnormal data points in the data by setting a Z-score threshold and a Euclidean distance threshold.

[0054] The steps of the logistic regression algorithm are as follows:

[0055] S1: Retrieve the dataset from the database and perform data preprocessing, including cleaning, transformation, and integration.

[0056] S2: Use Z-score to perform preliminary identification of outlier data points in the preprocessed data, determine whether the data point is a potential outlier data point, and use the set Euclidean distance threshold to perform a second judgment on the potential outlier data points initially identified by Z-score, determine the outlier data points, and remove the outlier data points from the dataset.

[0057] S3: Divide the processed data into training and test sets according to a certain ratio, build a logistic regression model based on the training set data, and determine the likelihood of the patient being in the early stage of pancreatic cancer based on the probability value output by the model.

[0058] Z-score refers to the score of each data point in the dataset. According to the formula The calculated values, where Let be the original value of the i-th data point in the dataset, μ be the mean of the dataset, and σ be the standard deviation of the dataset. By setting a Z-score threshold, when the absolute value of the Z-score of a data point is greater than the Z-score threshold, the data point is determined to be a potential outlier for further analysis and processing.

[0059] The Euclidean distance threshold is used to perform a secondary assessment on potential outlier data points initially identified by Z-score. The Euclidean distance between the potential outlier and the normal data points in the dataset is calculated. If this Euclidean distance exceeds a set threshold, the data point is ultimately determined to be an outlier. The Euclidean distance calculation method is as follows:

[0060] Suppose that each data point in the dataset has n feature dimensions, and the feature vector of the potential outlier data point P is... The feature vector of the normal data point Q is The Euclidean distance between potential outlier data point P and normal data point Q is... Calculate using the following formula:

[0061]

[0062] When the Euclidean distance When the distance exceeds the set Euclidean distance threshold, the potential outlier data point P is finally identified as an outlier data point.

[0063] The basic form of the logistic regression model is:

[0064]

[0065] in This represents the probability that a patient is in the early stage of pancreatic cancer, given the characteristic variable X. For the intercept term, For each characteristic variable, the regression coefficients are estimated using the maximum likelihood estimation method. Make an estimate.

[0066] In the predictive analysis module, a trained logistic regression model is used to predict the test set data and obtain the predicted probability value for each sample. When the predicted probability value is greater than the threshold, the patient is considered to have a high probability of being in the early stage of pancreatic cancer; otherwise, it is considered to have a low probability. The predictive analysis module is used to make the judgment and identification.

[0067] Specifically, the Z-score, also known as the standard score or standardized value, is a widely used concept and method in statistics. It is used to measure the relative position of a data point to the mean of a dataset. When collecting and organizing large amounts of screening data, it can quickly locate data points with excessively large absolute Z-score values, allowing for further verification and processing. It can help identify outliers and erroneous data in the screening data. By using the Z-score to initially screen the data, we can find individual data points that may be abnormal. Then, we can calculate the Euclidean distance between these data points and other data points to see if it exceeds a threshold, further confirming its abnormality. If the Euclidean distance is also large, it can more strongly indicate the abnormality of the data point and improve the accuracy of anomaly detection.

[0068] The logistic regression model, using a logistic function and linear regression for classification, predicts pancreatic cancer by converting various characteristic data related to pancreatic cancer into disease probabilities. These probabilities include tumor marker levels, morphological indicators of the pancreas in imaging examinations, patient age, and family history. This data is formatted to be acceptable to the model and input as independent variables into a pre-trained logistic regression model. The model calculates a linear combination value based on the input feature data, and then substitutes this calculated value into the logistic function to obtain the probability that the patient has pancreatic cancer.

[0069] It should be noted that, in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such process, method, article, or apparatus.

[0070] Although embodiments of the invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made to these embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the appended claims and their likenesses.

Claims

1. An artificial intelligence-based early pancreatic cancer screening system, characterized in that... This includes: Database: Establish a database containing the characteristics of pancreatic cancer. The database includes multi-dimensional feature information and presents corresponding pancreatic cancer feature data according to different stages of disease development. Recognition Unit: Used for recognizing text information and extracting features from patient medical records and examination reports; Interactive Unit: Used to provide users with an intuitive and convenient operating interface, provide detailed data analysis results and charts, and inform users of the risk of pancreatic cancer.

2. The artificial intelligence-based early pancreatic cancer screening system according to claim 1, characterized in that: The database contains features including CA19-9 index, high signal intensity on MRI DWI, decreased ADC value, abdominal pain symptoms, pancreatic nodule morphology, and characteristic data of main pancreatic duct diameter.

3. The artificial intelligence-based early pancreatic cancer screening system according to claim 1, characterized in that: The identification unit includes a data acquisition module, an image processing module, and a feature extraction module; Data acquisition module: By scanning and recognizing patients' paper medical records and examination reports, it provides basic data support for assessing the risk factors of pancreatic cancer and for subsequent screening. Image processing module: Used to process pancreatic images acquired by medical imaging equipment, remove interference noise, improve image clarity, highlight the differences between pancreatic tissue and surrounding structures, and make the pancreatic outline and internal texture clearer and more distinguishable, with the aim of improving the accuracy and effect of subsequent processing; Feature extraction module: Used to extract features related to pancreatic cancer from preprocessed medical images. These features include morphological features, texture features, density and signal features, and vascular features of pancreatic cancer, transforming complex medical image data into a feature representation with quantifiable indicators.

4. The artificial intelligence-based early pancreatic cancer screening system according to claim 3, characterized in that: The identification unit also includes a prediction analysis module and a result output module; Predictive analysis module: Using data and algorithms, it identifies the input feature data and information from the data acquisition module, eliminates false data, and further focuses on the screening of pancreatic cancer patients for the retained data, and comprehensively analyzes and judges the probability that the patient is in the early stage of pancreatic cancer; Results output module: The pancreatic cancer screening results generated by the predictive analysis module, and the relevant information of the screening output by the identification unit.

5. The artificial intelligence-based early pancreatic cancer screening system according to claim 1, characterized in that: The predictive analysis module uses a logistic regression algorithm to screen data related to early pancreatic cancer, and identifies and marks abnormal data points by setting Z-score thresholds and Euclidean distance thresholds.

6. The artificial intelligence-based early pancreatic cancer screening system according to claim 5, characterized in that: The steps of the logistic regression algorithm are as follows: S1: Retrieve the dataset from the database and perform data preprocessing, including cleaning, transformation, and integration. S2: Use Z-score to perform preliminary identification of outlier data points in the preprocessed data, determine whether the data point is a potential outlier data point, and use the set Euclidean distance threshold to perform a second judgment on the potential outlier data points initially identified by Z-score, determine the outlier data points, and remove the outlier data points from the dataset. S3: Divide the processed data into training and test sets according to a certain ratio, build a logistic regression model based on the training set data, and determine the likelihood of the patient being in the early stage of pancreatic cancer based on the probability value output by the model.

7. The artificial intelligence-based early pancreatic cancer screening system according to claim 6, characterized in that: The Z-score refers to the Z-score calculated for each data point in the dataset. According to the formula The calculated values, where Let μ be the original value of the i-th data point in the dataset, μ be the mean of the dataset, and σ be the standard deviation of the dataset. By setting a Z-score threshold, when the absolute value of the Z-score of a data point is greater than the Z-score threshold, the data point is determined to be a potential outlier for further analysis and processing.

8. The artificial intelligence-based early pancreatic cancer screening system according to claim 6, characterized in that: The Euclidean distance threshold is used to perform a secondary assessment on potential outlier data points initially identified by Z-score. The Euclidean distance between the potential outlier data point and the normal data points in the dataset is calculated. When this Euclidean distance exceeds the set threshold, the data point is ultimately determined to be an outlier. The Euclidean distance calculation method is as follows: Suppose that each data point in the dataset has n feature dimensions, and the feature vector of the potential outlier data point P is... The feature vector of the normal data point Q is The Euclidean distance between potential outlier data point P and normal data point Q is... Calculate using the following formula: When the Euclidean distance When the distance exceeds the set Euclidean distance threshold, the potential outlier data point P is finally identified as an outlier data point.

9. The artificial intelligence-based early pancreatic cancer screening system according to claim 6, wherein... Its features are: The basic form of the logistic regression model is as follows: in This represents the probability that a patient is in the early stage of pancreatic cancer, given the characteristic variable X. For the intercept term, For each characteristic variable, the regression coefficients are estimated using the maximum likelihood estimation method. Make an estimate.

10. The artificial intelligence-based early pancreatic cancer screening system according to claim 9, characterized in that: In the predictive analysis module, a trained logistic regression model is used to predict the test set data to obtain the predicted probability value of each sample. When the predicted probability value is greater than the threshold, the patient is judged to have a high probability of being in the early stage of pancreatic cancer; otherwise, it is judged to be low. The predictive analysis module makes the judgment and identification.