A polyp detection model training method based on deep learning
By training a base model using colonoscopy report images and extracting pseudo-labeled samples from unlabeled videos using a time window filter, an image-video-connector polyp detection network was constructed using an asymmetric hybridization algorithm to fuse image and video data. This approach bridges the domain gap between colonoscopy report images and real-time videos, improves the accuracy and reliability of polyp detection, and reduces annotation costs.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- AFFILIATED HUSN HOSPITAL OF FUDAN UNIV
- Filing Date
- 2024-07-24
- Publication Date
- 2026-06-23
Smart Images

Figure CN118966308B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to a medical image processing method, and more particularly to a method for training a colorectal polyp detection model based on an artificial intelligence deep learning algorithm, belonging to the field of colonoscopy polyp detection technology. Background Technology
[0002] Colorectal cancer (CRC) is a collective term for a group of tumors that occur in the colon and rectum. It is currently the third most common cancer worldwide and the second most common cancer in terms of mortality, and the fourth most common and fifth most common cancer in China. Colonoscopy is the core technology and gold standard for early diagnosis and screening of colorectal cancer. High-quality colonoscopy can effectively reduce the incidence and mortality of colorectal cancer. Before CRC threatens health, it manifests as colon polyps, which can be detected and treated early through colonoscopy. Therefore, accurate detection of polyps is crucial.
[0003] Currently, polyp detection relies primarily on the endoscopist's vision and experience; however, due to the diverse morphologies of colon polyps, distinguishing them from normal structures is extremely difficult. Even experienced endoscopists inevitably experience a rate of missed polyps. Although computer-aided detection (CAD) systems exist to improve detection rates, existing CAD systems still have room for improvement in accuracy and efficiency.
[0004] The typical process of detecting colon polyps can be summarized in four stages: Stage 1: The endoscopist scans the colon to look for polyps while withdrawing the endoscope; Stage 2: The polyp appears in the field of view; Stage 3: The endoscopist notices the potential polyp and stops removing the catheter; Stage 4: The endoscopist adjusts the field of view for careful observation and takes photographs for the report. The endoscopist's correct judgment in Stage 3 is crucial and a prerequisite for observation and recording in Stage 4. Therefore, the stage in clinical practice where computer-aided diagnosis of colon polyps is most needed is the second stage, when the polyp has just appeared. In other words, CAD needs to be performed simultaneously with the colonoscopy in a real-time video stream.
[0005] Recently, deep learning has been widely applied in colon polyp CAD due to its superior performance. Training polyp detection models directly using colonoscopy videos is an intuitive solution. However, annotating polyp regions in colonoscopy videos is a laborious task for professional endoscopists. Since video recording is not a routine procedure for colonoscopies, collecting sufficient data (especially positive samples) is also very time-consuming. These two difficulties in clinical practice limit the size of video datasets, and a lack of diversity in the dataset can lead to overfitting of deep learning models. In conclusion, manually annotating polyp regions in large-scale video datasets is both time-consuming and expensive, limiting the development of deep learning technology.
[0006] One compromise is to train the target model using labeled colonoscopy report images and then perform inference on real-time video. Since these images can be extracted from medical report databases, researchers can easily obtain a much larger dataset than recording videos from scratch. However, there are some issues between image-based training and video-based inference, including domain differences and a lack of positive video samples. Summary of the Invention
[0007] The purpose of this invention is to reduce the cost of establishing a training dataset for polyp detection models, and to address the domain gap between colonoscopy report images and real-time videos, thereby improving the accuracy and reliability of polyp detection and its dataset establishment, while reducing annotation costs.
[0008] To achieve the above objectives, the technical solution of the present invention provides a method for training a polyp detection model based on deep learning, characterized by comprising the following steps:
[0009] Step 1: Train the basic polyp detection model using source domain report images;
[0010] Step 2: Use the trained base polyp detection model to extract pseudo-label samples from the unlabeled target domain video. The pseudo-label samples include positive pseudo-label samples and hard negative samples. The positive pseudo-label samples contain polyps under different shooting angles and lighting conditions, while the hard negative samples contain various misleading noises.
[0011] Step 3: Retrain the basic polyp detection model using source domain report images mixed with hard negative samples, and then fine-tune the basic polyp detection model using positive pseudo-label samples.
[0012] Preferably, in step 2, a time window filter is used to extract the positive pseudo-label sample and the hard negative sample. During extraction, the positive pseudo-label sample or the hard negative sample is determined by whether there are other polyp predictions with high density around a polyp prediction in any frame of the target domain video. If there are other polyp predictions with high density around a polyp prediction, the current frame of the video is the positive pseudo-label sample; otherwise, the current frame of the video is the hard negative sample.
[0013] Preferably, in step 2, the time window filter extracts the positive pseudo-label sample and the hard negative sample using the following steps:
[0014] The inference output Det({x) obtained from the target domain video through the basic polyp detection model i}) is represented as:
[0015] Det({x i})={p i}={{(s ij ,b ij )}}
[0016] In the formula, p i For the i-th frame of the target domain video x i Box-level prediction, s ij With b ij p i The confidence score and bounding box coordinates of the j-th polyp prediction;
[0017] Based on box-level prediction p i Maximum confidence score s i s i =max(s ij Each frame of the unlabeled converted training video is assigned a binary classification prediction c. i As shown in the following formula:
[0018]
[0019] In the formula, the constant T cls The classification threshold;
[0020] Calculate the unlabeled training video x of frame i. i The surrounding positive density d i As shown in the following formula:
[0021]
[0022] In the formula, t is the start time of the time window, and the constant L is the length of the time window;
[0023] High-confidence predictions with high surrounding positive density are extracted as positive pseudo-label samples, as shown in the following formula:
[0024] f pos ({p i})={(x i ,y i )∣d i ≥D∧s i ≥T bbox}
[0025] y i ={b ij |s ij ≥T bbox}
[0026] In the formula, y i The i-th frame is the unlabeled training video x. i The positive pseudo-label, the constant D is the density threshold of the time window filter, and the constant T bboxIt is the bounding box extraction threshold, f pos ({p i}) is a positive pseudo-label sample with high confidence prediction and high surrounding positive density, (x i ,y i () represents an image frame and its positive pseudo-label;
[0027] High-confidence predictions with low surrounding positive density are extracted as hard negative samples, as shown in the following formula:
[0028]
[0029] In the formula, f hardNeg ({p i}) is a hard negative sample with high confidence prediction of low surrounding positive density. This indicates that the frame does not have a positive pseudo-label.
[0030] Preferably, in step 3, when mixing the hard negative sample with the source domain report image, an asymmetric mixing algorithm is used to utilize the negative pseudo-label sample.
[0031] Preferably, in step 3, the virtual training sample constructed by mixing the hard negative sample with the source domain report image... as follows:
[0032]
[0033] In the formula, It is any positive input frame in the source domain report image. yes The tag, It is a hard negative frame randomly selected from the hard negative pseudo-label samples extracted in step 2, and λ is a random variable that controls the mixing ratio.
[0034] Preferably, an asymmetric discrete distribution is used to limit the influence of hard negative samples, and the mixing intensity and probability are controlled separately, as shown in the following equation:
[0035] P(λ=γ)=ρ,λ∈{0,γ}
[0036] In the formula, λ can be 0 or γ, where the probability of λ being γ is ρ. γ and ρ are two constant hyperparameters. Hyperparameter γ is used to control the mixing intensity, and hyperparameter ρ is used to control the mixing probability.
[0037] This invention provides an image-video-jOint POlypDetection network (Evo-Pod Network), which is a training method for unsupervised domain adaptation. It improves the accuracy and reliability of polyp detection and its dataset establishment, while reducing annotation costs.
[0038] Compared with existing technical solutions, the present invention has the following beneficial effects:
[0039] 1) Reduced the cost of building the training dataset for the polyp detection model;
[0040] 2) The training method of Evo-Pod Network (Image-Video-Connector Polyp Detection Network) was proposed to solve the domain gap between colonoscopy report images and real-time videos. Attached Figure Description
[0041] Figure 1 This is a flowchart of the solution of the present invention;
[0042] Figure 2 The time window confirmation algorithm is illustrated. Detailed Implementation
[0043] The present invention will be further illustrated below with reference to specific embodiments. It should be understood that these embodiments are for illustrative purposes only and are not intended to limit the scope of the invention. Furthermore, it should be understood that after reading the teachings of this invention, those skilled in the art can make various alterations or modifications to the invention, and these equivalent forms also fall within the scope defined by the appended claims.
[0044] I) Foundation: The base polyp detection model is trained using colonoscopy report images (hereinafter referred to as “source domain report images”) to learn polyps with morphological diversity.
[0045] II) Extraction: Pseudo-label samples are extracted from unlabeled target domain videos using a basic polyp detection model. This invention argues that two types of samples are valuable for domain adaptation because they typically do not appear in source domain report images: one is positive samples, also known as positive pseudo-label samples, which contain polyps under different shooting angles and lighting conditions; the other is hard negative samples, which contain various misleading noises, such as feces and bubbles.
[0046] Due to the temporal consistency of colonoscopy videos, adjacent frames should contain similar content. If one frame contains a polyp, its adjacent frames are also likely to contain polyps, and vice versa. In other words, if a polyp prediction is surrounded by a high density of other polyp predictions, it is likely a true positive; otherwise, it is a false positive. This invention refers to false positive samples detected using this method as hard negative samples.
[0047] This invention proposes a simple time window filtering (TWF) method to accurately extract positive pseudo-label samples and hard negative samples. The specific strategy is as follows:
[0048] The inference output Det({x) obtained by transforming the unlabeled training video (i.e., the "unlabeled target domain video" mentioned above) through the basic polyp detection model i}) is represented as:
[0049] Det({x i})={p i}={{(s ij ,b ij )}}
[0050] In the formula, p i Transform the unlabeled training video x of frame i into its corresponding frame. i Box-level prediction, s ij With b ij p i The confidence score and bounding box coordinates of the j-th polyp prediction.
[0051] Based on box-level prediction p i Maximum confidence score s i s i =max(s ij Each frame of the unlabeled converted training video is assigned a binary classification prediction c. i (Positive is 1, negative is 0), as shown in the following formula:
[0052]
[0053] In the formula, the constant T cls This is the classification threshold.
[0054] Calculate the unlabeled training video x of frame i. i The surrounding positive density d i :
[0055]
[0056] In the formula, t is the start time of the time window, and L is the length of the time window. This invention defines d by obtaining the maximum density within a set of sliding time windows [t, t+L–1]. i Instead of a single fixed time window.
[0057] High-confidence predictions with high surrounding positive density are extracted as positive pseudo-label samples, as shown in the following formula:
[0058] f pos ({p i})={(x i ,y i )∣d i ≥D∧s i ≥T bbox}
[0059] y i ={b ij |s ij ≥T bbox}
[0060] In the formula, y i The i-th frame is the unlabeled training video x. i The positive pseudo-label, the constant D is the density threshold of the time window filter, and the constant T bbox It is the bounding box extraction threshold, f pos ({p i}) is a positive pseudo-label sample with high confidence prediction and high surrounding positive density, (x i ,y i ) represents an image frame and its positive pseudo-label.
[0061] High-confidence predictions with low surrounding positive density are extracted as false positive samples, i.e., as hard negative samples, as shown in the following formula:
[0062]
[0063] In the formula, f hardNeg ({p i}) is a hard negative sample with high confidence prediction of low surrounding positive density. This indicates that the frame does not have a positive pseudo-label.
[0064] To support larger time window lengths L, this embodiment of the invention does not directly calculate d. i Instead, use such as Figure 2 The algorithm shown is used to determine whether d i ≥D, complexity from O(NL) 2 The process is reduced to O(N), where N is the number of video frames, and includes the following steps:
[0065] Step 1, Initialization: Set {v i Set} = 0, and initialize the summation variable S and the position variable P.
[0066] Step 2, Traverse video frames: Loop from frame 1 to frame N, and perform the following steps:
[0067] Step 201, Accumulate the predicted value: Accumulate the predicted value c of the current frame. i Add it to the summation variable S.
[0068] Step 202, Sliding Window: If the current frame index is greater than the time window length L, then subtract the predicted value c of frame iL. {i-L} To maintain the length of the sliding window.
[0069] Step 203, Density Calculation and Verification: If the density S / L within the current window is greater than or equal to the threshold D, then:
[0070] Update positive sample flags: update the flags v for all frames in the range from max(P+1, i-L+1) to i. j Set to 1.
[0071] Update position variable: Update the position variable P to the current frame index i.
[0072] III) Application: Use the pseudo-labeled samples obtained in step II) to improve polyp detection performance.
[0073] This invention first retrains the basic polyp detection model using source domain report images mixed with hard negative samples to reduce false positives caused by various noises in real-time video. Then, the basic polyp detection model is fine-tuned using positive pseudo-label samples to adapt to various shooting conditions in real-time video. The specific strategy is as follows:
[0074] This invention utilizes the positive pseudo-label samples and hard negative pseudo-label samples extracted in step II) to adapt to the basic polyp detection model of the target domain.
[0075] Since negative samples cannot be directly used to train the basic polyp detection model, this invention proposes an asymmetric hybrid algorithm to utilize negative pseudo-label samples. It retrains the basic model by mixing source domain report images with hard negative samples to reduce false positives. Specifically, this invention randomly adds hard negative samples to positive samples at the pixel level (after adjusting to the same size), using the background noise in the hard negative samples to enhance the positive training data. Formally, the virtual training samples constructed in this invention... as follows:
[0076]
[0077] In the formula, It is any positive input frame in the training dataset. yes The tag, It is a hard negative frame randomly selected from the hard negative pseudo-label samples extracted in step II), and λ is a random variable that controls the mixing ratio.
[0078] This invention employs an asymmetric discrete distribution to limit the influence of hard negative samples and controls the mixing intensity and probability separately, as shown in the following equation:
[0079] P(λ=γ)=ρ,λ∈{0,γ}
[0080] In the formula, λ can be 0 or γ, where the probability of λ being γ is ρ. γ and ρ are two constant hyperparameters that allow us to adjust the influence of hard negative samples, where hyperparameter γ controls the mixing intensity and hyperparameter ρ controls the mixing probability.
[0081] Then, the present invention fine-tunes the basic polyp detection model using positive pseudo-label samples to adapt to various shooting conditions of real-time video.
Claims
1. A method for training a polyp detection model based on deep learning, characterized in that, Includes the following steps: Step 1: Train the basic polyp detection model using source domain report images; Step 2: Use the trained basic polyp detection model to extract pseudo-label samples from the unlabeled target domain video. The pseudo-label samples include positive pseudo-label samples and hard negative samples. The positive pseudo-label samples contain polyps under different shooting angles and lighting conditions, while the hard negative samples contain various misleading noises. A time window filter is used to extract the positive pseudo-label samples and the hard negative samples. During extraction, it is determined whether a polyp prediction in any frame of the target domain video is a positive pseudo-label sample or a hard negative sample by judging whether there are other polyp predictions with high density around it: if there are other polyp predictions with high density around it, then the current frame of the video is the positive pseudo-label sample; otherwise, the current frame of the video is the hard negative sample. The time window filter extracts the positive pseudo-label sample and the hard negative sample using the following steps: Inference output obtained from the target domain video using the basic polyp detection model Represented as: In the formula, For the target domain video, the first Frame video Box-level prediction, and They are respectively The Middle Confidence score and bounding box coordinates of each polyp prediction; Based on box-level prediction Maximum confidence score , Assign a binary classification prediction to each frame of the unlabeled converted training video. As shown in the following formula: In the formula, constants The classification threshold; Calculate the first Unlabeled frame conversion training video Surrounding positive density As shown in the following formula: In the formula, The start time of the time window is a constant. The length of the time window; High-confidence predictions with high surrounding positive density are extracted as positive pseudo-label samples, as shown in the following formula: In the formula, It is the first Unlabeled frame conversion training video Positive pseudo-labels, constant It is the density threshold of the time window filter, a constant. It is the bounding box extraction threshold. These are positive pseudo-labeled samples with high confidence predictions and high surrounding positive density; High-confidence predictions with low surrounding positive density are extracted as hard negative samples, as shown in the following formula: In the formula, It is a hard negative sample with high confidence prediction of low surrounding positive density. This indicates that the frame does not contain any positive pseudo-labels; Step 3: Retrain the basic polyp detection model using source domain report images mixed with hard negative samples, and then fine-tune the basic polyp detection model using positive pseudo-label samples.
2. The method for training a polyp detection model based on deep learning as described in claim 1, characterized in that, In step 3, when the hard negative sample is mixed with the source domain report image, an asymmetric mixing algorithm is used to utilize the hard negative sample.
3. The method for training a polyp detection model based on deep learning as described in claim 2, characterized in that, In step 3, the virtual training samples are constructed by mixing the hard negative samples with the source domain report images. as follows: In the formula, It is any positive input frame in the source domain report image. yes The tag, It is a hard negative frame randomly selected from the hard negative samples extracted in step 2. A random variable used to control the mixing ratio.
4. The method for training a polyp detection model based on deep learning as described in claim 3, characterized in that, An asymmetric discrete distribution is used to limit the influence of hard negative samples, and the mixing strength and probability are controlled separately, as shown in the following equation: In the formula, It can be 0 or ,in, The probability of being γ is , and These are two constant hyperparameters. Hyperparameters used to control mixing intensity Used to control the mixing probability.