Gastrointestinal endoscopic lesion classification model training method based on small sample learning
By constructing a multimodal time-series graph of gastrointestinal endoscopy and dynamic cross-modal attention fusion, the problems of insufficient information utilization and overfitting due to few-sample learning in gastrointestinal endoscopy image classification are solved, and more efficient lesion classification is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- NANJING SUOTU TECH CO LTD
- Filing Date
- 2026-04-08
- Publication Date
- 2026-06-26
AI Technical Summary
Existing methods for classifying gastrointestinal endoscopic images have limitations in terms of information, the correlation between modalities and the dynamic changes of lesions in time and space are not fully utilized, and the models are prone to overfitting and have poor generalization ability in small sample learning scenarios.
By acquiring white light endoscopy, NBI, and confocal images, a modal temporal graph is constructed, 3D spatiotemporal alignment is performed, a dynamic cross-modal attention weight graph is generated, and a meta-learning classification model is trained by combining cross-modal shared features and temporal dynamic features. The attention weights are then visualized using a heatmap.
It improves the model's utilization and comprehensiveness of lesion information, enhances its generalization ability in small samples, and enables more accurate classification of gastrointestinal endoscopic images.
Smart Images

Figure CN122023944B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of computer-aided diagnostic technology, and more specifically to a training method for a gastrointestinal endoscopic lesion classification model based on few-sample learning. Background Technology
[0002] Gastrointestinal endoscopy is a crucial method for diagnosing gastrointestinal diseases. However, the interpretation of endoscopic images highly depends on the experience and expertise of physicians, and discrepancies in interpretation can exist between different doctors. To improve the accuracy, efficiency, and consistency of diagnosis, computer-aided diagnostic systems based on deep learning have emerged. Early deep learning models, such as convolutional neural networks, achieved significant success in the classification of gastrointestinal endoscopic images. These models typically require large amounts of labeled data for training to learn complex image features. However, in the field of gastrointestinal endoscopy, obtaining large-scale, high-quality, finely labeled datasets presents numerous challenges.
[0003] Few-shot learning is a technique developed to address the problem of data scarcity. It enables models to quickly learn and identify new categories with only a small number of labeled samples, thereby developing more practical, easier-to-deploy, and less data-dependent computer-aided diagnostic systems, ultimately helping doctors improve diagnostic efficiency and accuracy.
[0004] However, the above process still has the following drawbacks:
[0005] Firstly, existing methods may only use single-modality gastrointestinal endoscopy images for training, such as using only white light endoscopy images. However, the information contained in a single-modality image is limited and may not be able to fully and accurately reflect the characteristics of the lesion.
[0006] Secondly, existing methods may ignore the dynamic changes of lesions in time and space, and may only focus on features within a single modality during feature extraction, without considering the correlation between different modalities and the dynamic changes of lesions in the time dimension.
[0007] Third, in small-sample learning scenarios, traditional classification models are prone to overfitting due to the limited number of training samples, resulting in poor generalization ability of the model on new samples. Summary of the Invention
[0008] To overcome the aforementioned deficiencies in the prior art, this invention provides a training method for a gastrointestinal endoscopic lesion classification model based on few-sample learning, in order to solve the problems existing in the background art.
[0009] This invention provides the following technical solution: a training method for a gastrointestinal endoscopic lesion classification model based on few-sample learning, comprising:
[0010] S1: Obtain white light endoscopy images, NBI images, and confocal images of the digestive tract by locating the lesion area, label the lesion area in the white light endoscopy image, map the labeling information with the NBI image and confocal image, and construct a modal time series diagram;
[0011] S2: Based on the constructed modal time series diagram, the annotation information of different modalities is unified into the same coordinate system to form three-dimensional spatiotemporally aligned multimodal data, and then the aligned multimodal data is classified according to the lesion state;
[0012] S3: Perform dynamic cross-modal attention fusion on the classified multimodal data to generate a dynamic cross-modal attention weight map;
[0013] S4: Compare the fused dynamic cross-modal attention weight map with tasks across different modalities to obtain cross-modal shared features. Combine the fused dynamic cross-modal attention weight map with the temporal information in the modal temporal map to compare tasks across different spatiotemporal periods to obtain temporal dynamic features.
[0014] S5: Combine cross-modal shared features with temporal dynamic features to form cross-modal temporal consistency features, and combine them with cross-modal attention weight maps to train the meta-learning classification model;
[0015] S6: Visualize attention weights using heatmaps;
[0016] S7: Export the trained meta-learning classification model and deploy it to a real-world application platform to classify new gastrointestinal endoscopic images.
[0017] Preferably, the step of constructing the modal timing diagram includes:
[0018] S101: The lesion area in the digestive tract is initially located using white light endoscopy, and white light endoscopy images of the lesion area are acquired. Subsequently, during the same examination, the system is switched to NBI endoscopy mode to acquire NBI images of the same lesion area. Finally, a confocal laser microendoscopy probe is used to acquire cell-level resolution confocal images of the area.
[0019] S102: Perform image preprocessing on the acquired white light endoscopy images, NBI images, and cell-level resolution confocal images;
[0020] S103: Mark the lesion area in the white light endoscopy image. The marking information is automatically mapped to the NBI image and confocal image through image registration technology to achieve spatial alignment of the lesion area between multimodal images. At the same time, modality-specific feature data is extracted from the marked area and combined with the examination timestamp to construct a spatiotemporally aligned modal time series diagram.
[0021] Preferably, the formation of the three-dimensional spatiotemporally aligned multimodal data includes:
[0022] Based on the modal time sequence diagram, the annotation information of different modes is transformed from their original coordinate system to a unified coordinate system;
[0023] The structural similarity between the image containing the transformed labeled region and the image containing the corresponding labeled region in the reference modal image is calculated to verify spatial alignment. If the structural similarity is greater than the preset similarity threshold, the spatial alignment verification is deemed successful.
[0024] After confirming that the spatial alignment verification is passed, the time difference between different modal acquisitions is compensated to ensure time axis synchronization, and the inter-frame difference of the aligned video is calculated. If the inter-frame difference is less than the preset inter-frame difference threshold, the time alignment verification is confirmed to be passed.
[0025] The spatially and temporally aligned modal data and their corresponding annotation information are integrated to form three-dimensional spatiotemporally aligned multimodal data. Then, a deep learning model is used to automatically classify the multimodal data according to the lesion state.
[0026] Preferably, the classification of the multimodal data is achieved by dividing the aligned different modal data into data subsets according to the lesion state, thereby forming a multimodal subset dataset.
[0027] Preferably, the step of generating a dynamic cross-modal attention weight map includes:
[0028] Construct a network structure that dynamically calculates attention weights between different modalities, which contains multiple branches, each branch processes data from one modality, and realizes information interaction between different modalities through an interaction layer;
[0029] The classified multimodal data are input into the corresponding branches of the dynamic cross-modal attention network. In the interaction layer, the similarity between different modalities is dynamically calculated to obtain the attention weights between different modalities. Based on the attention weights between different modalities, the data of different modalities are fused. At the same time, during the attention fusion process, the attention weights between different modalities at each position are recorded to generate a dynamic cross-modal attention weight map.
[0030] Preferably, the step of comparing the fused dynamic cross-modal attention weight map with tasks across different modalities to obtain cross-modal shared features includes:
[0031] By extracting attention weight features of different modalities from the dynamic cross-modal attention weight map, calculating the attention weight distribution, comparing the attention weight distributions of different modalities on the same task, calculating the difference values between different modalities, identifying information that is jointly concerned by multiple modalities based on the difference values, and using it as cross-modal shared features.
[0032] Preferably, the step of combining the fused dynamic cross-modal attention weight map with the temporal information in the modality temporal graph to compare different spatiotemporal tasks and obtain temporal dynamic features includes:
[0033] Temporal features are extracted by performing temporal convolution on the modal temporal graph. Then, the extracted temporal features are weighted and aggregated using element-wise multiplication and a dynamic cross-modal attention weight map as weights. The weighted aggregated features are then compared and analyzed with features from different spatiotemporal tasks. By comparing the similarity differences, features with significant discriminative power are selected as the temporal dynamic features.
[0034] Preferably, the training of the meta-learning classification model includes:
[0035] The extracted cross-modal shared features and temporal dynamic features are concatenated to obtain a joint feature vector. Principal component analysis is then used to reduce the dimensionality of the joint feature vector. After concatenation and dimensionality reduction, a feature vector that captures both cross-modal shared information and temporal dynamic change information is obtained, which serves as the cross-modal temporal consistency feature.
[0036] The meta-learning classification model is trained using cross-modal temporal consistency features;
[0037] Export the trained meta-learning classification model into a deployable format and deploy it on different platforms.
[0038] Preferably, the attention visualization includes:
[0039] The attention weights are visualized using heatmaps, which map the attention weights onto the pixels of the image. Areas with higher weight values are represented by brighter colors, and the areas that the model focuses on in the image are shown.
[0040] Preferably, the deployment of the meta-learning classification model includes:
[0041] The trained meta-learning classification model is exported in a deployable format and deployed on different platforms. The deployed model is then applied to actual gastrointestinal endoscopic lesion classification tasks to classify new endoscopic images.
[0042] The technical effects and advantages of this invention are as follows:
[0043] (1) By locating the lesion area, white light endoscopy images, NBI images and confocal images are obtained, and the annotation information is mapped to construct a modal time series diagram. This integration of multimodal data makes full use of the advantages of different modal images, provides the model with richer lesion information, and improves the utilization rate of data.
[0044] (2) By unifying the annotation information of different modalities into the same coordinate system to form three-dimensional spatiotemporally aligned multimodal data, the information of lesions in time and space can be considered at the same time, and the characteristics and development laws of lesions can be understood more comprehensively, which further improves the efficiency of data utilization. By comparing tasks between different modalities through dynamic cross-modal attention weight map, cross-modal shared features can be obtained. Attention weights can be dynamically allocated according to the correlation between different modal data to highlight key information, thereby extracting cross-modal shared features, making the model's understanding of lesions more comprehensive. By combining the fused dynamic cross-modal attention weight map with the temporal information in the modal temporal map, the temporal dynamic features can be obtained by comparing different spatiotemporal tasks. The model can capture the dynamic changes of lesions in the time dimension and extract temporal dynamic features with discriminativeness, which further enriches the feature representation.
[0045] (3) By combining cross-modal shared features with temporal dynamic features to form cross-modal temporal consistency features, and combining them with cross-modal attention weight maps, the meta-learning classification model can be trained, enabling the model to quickly adapt to new tasks. By learning general learning strategies on small sample data, the model's generalization ability on new samples can be improved. Attached Figure Description
[0046] Figure 1 This is a diagram illustrating the method steps of the present invention.
[0047] Figure 2 This is a system structure block diagram of the present invention. Detailed Implementation
[0048] The technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings. In addition, the forms of the various structures described in the following embodiments are merely illustrative. The method for training a gastrointestinal endoscopic lesion classification model based on small sample learning involved in the present invention is not limited to the structures described in the following embodiments. All other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0049] like Figure 1 The embodiment shown provides a training method for a gastrointestinal endoscopic lesion classification model based on few-shot learning, including:
[0050] S1: Obtain white light endoscopy images, NBI images, and confocal images of the digestive tract by locating the lesion area, label the lesion area in the white light endoscopy image, and map the labeling information with the NBI image and confocal image to construct a modal time series diagram.
[0051] In this embodiment, the step of constructing the modal timing diagram includes:
[0052] S101: The lesion area in the digestive tract is initially located using white light endoscopy, and white light endoscopy images of the lesion area are acquired. Subsequently, during the same examination, the system is switched to NBI endoscopy mode to acquire NBI images of the same lesion area. Finally, a confocal laser microendoscopy probe is used to acquire cell-level resolution confocal images of the area.
[0053] S102: Perform image preprocessing on the acquired white light endoscopy images, NBI images, and cell-level resolution confocal images;
[0054] S103: Mark the lesion area in the white light endoscopy image. The marking information is automatically mapped to the NBI image and confocal image through image registration technology to achieve spatial alignment of the lesion area between multimodal images. At the same time, modality-specific feature data is extracted from the marked area and combined with the examination timestamp to construct a spatiotemporally aligned modal time series diagram.
[0055] Specifically, a target detection model or segmentation model is deployed in the white light endoscopy image to generate a bounding box or segmentation mask for the lesion area. Based on the bounding box / mask, the lens angle is automatically adjusted to center the lesion area in the field of view. Then, the system is simultaneously switched to NBI and confocal modes to acquire high-contrast vascular images and cellular microstructure images of the same lesion area. A feature point matching algorithm is used to verify that the spatial overlap of the three modal images is greater than the overlap threshold. The acquired images are then used to construct the modal time series map. Otherwise, the position of the endoscope needs to be adjusted, and the image of the lesion area needs to be acquired again. After annotating the lesion area in the white light endoscopy image, a non-rigid image registration technique is used to automatically map the annotation information to the NBI and confocal images, taking into account the flexible deformation characteristics of digestive tract tissues (such as gastric and colonic peristalsis), achieving spatial alignment of the lesion area between multimodal images. By calculating the area and texture features of the annotated area, vascular density is extracted, and cell nuclear area and nucleocytoplasmic ratio are quantified. The features are then arranged according to the examination timestamp to generate a time series curve, resulting in time series data of multimodal features.
[0056] S2: Based on the constructed modal time series diagram, the annotation information of different modalities is unified into the same coordinate system to form three-dimensional spatiotemporally aligned multimodal data, and then the aligned multimodal data is classified according to the lesion state.
[0057] In this embodiment, the formation of the three-dimensional spatiotemporally aligned multimodal data includes:
[0058] Based on the modal time sequence diagram, the annotation information of different modes is transformed from their original coordinate system to a unified coordinate system;
[0059] The structural similarity between the image containing the transformed labeled region and the image containing the corresponding labeled region in the reference modal image is calculated to verify spatial alignment. If the structural similarity is greater than the preset similarity threshold, the spatial alignment verification is deemed successful.
[0060] After confirming that the spatial alignment verification is passed, the time difference between different modal acquisitions is compensated to ensure time axis synchronization, and the inter-frame difference of the aligned video is calculated. If the inter-frame difference is less than the preset inter-frame difference threshold, the time alignment verification is confirmed to be passed.
[0061] The spatially and temporally aligned modal data and their corresponding annotation information are integrated to form three-dimensional spatiotemporally aligned multimodal data. Then, a deep learning model is used to automatically classify the multimodal data according to the lesion state.
[0062] The classification of multimodal data involves dividing the aligned different modal data into subsets according to the lesion state, thereby forming a multimodal subset dataset.
[0063] It's important to clarify that a unified coordinate system needs to be established. Typically, the coordinate system of one of the modes or a virtual world coordinate system is chosen as the target coordinate system. Using pre-calibrated parameters that describe the transformation relationships between different mode coordinate systems, the annotation information for each mode is transformed from its original coordinate system to this unified coordinate system. For example, for a point in the coordinate system of mode A... Its representation in a unified coordinate system ,in This represents the unified relationship between the original coordinate system A and the unified coordinate system. This indicates the offset of the origin of the original coordinate system A within the same coordinate system.
[0064] The verification of spatial alignment involves selecting one or more key labeled regions and calculating the similarity between the corresponding image / point cloud region and the corresponding region in the reference modal image / point cloud after transformation to a unified coordinate system. The specific steps include: extracting the content of the corresponding labeled regions from the transformed modal image / point cloud and the reference modal image / point cloud respectively; calculating the SSIM value of these two regions, where the SSIM value ranges from [-1, 1], and the closer it is to 1, the more similar the two regions are in terms of brightness, contrast, and structure; setting a preset similarity threshold, and if the calculated SSIM value is greater than the threshold, the spatial alignment verification is considered to be successful, that is, the different modalities are basically aligned in space.
[0065] Time alignment requires compensating for the acquisition time difference between different modalities. For video stream data, interpolation or selecting the frame closest to the target timestamp may be necessary. The goal is to align the data of different modalities to the same time point or time window on the timeline. After alignment, the specific steps for time alignment verification include: selecting the aligned video sequence, calculating the inter-frame difference between two consecutive frames, setting a preset inter-frame difference threshold, and if the calculated inter-frame difference is less than the preset inter-frame difference threshold, the time alignment verification is considered to be successful, meaning that the data of different modalities are basically synchronized in time.
[0066] The classification steps for multimodal data include:
[0067] S201: Extracting features related to disease state from three-dimensional spatiotemporally aligned multimodal data;
[0068] S202: Label the extracted feature data based on the known lesion status information, and divide the labeled dataset into training set, validation set and test set;
[0069] S203: Select a suitable classification model based on the characteristics of the data and the requirements of the classification task. Train the selected classification model using the training set, adjust the model parameters to minimize the classification error, and validate the model using the validation set during the training process. Adjust the model's hyperparameters based on the validation results to prevent overfitting or underfitting. Evaluate the trained model using the test set and calculate the model's classification accuracy, recall, F1 score, and other metrics to evaluate the model's generalization ability and classification performance.
[0070] S204: Input the extracted features related to the lesion state into the trained classification model for training, and output the classification results. Based on the classification results of the model, classify the three-dimensional spatiotemporally aligned multimodal data according to the lesion state to obtain data subsets of different lesion states.
[0071] S3: Perform dynamic cross-modal attention fusion on the classified multimodal data to generate a dynamic cross-modal attention weight map.
[0072] In this embodiment, the step of generating a dynamic cross-modal attention weight map includes:
[0073] Construct a network structure that dynamically calculates attention weights between different modalities, which contains multiple branches, each branch processes data from one modality, and realizes information interaction between different modalities through an interaction layer;
[0074] The classified multimodal data are input into the corresponding branches of the dynamic cross-modal attention network. In the interaction layer, the similarity between different modalities is dynamically calculated to obtain the attention weights between different modalities. Based on the attention weights between different modalities, the data of different modalities are fused. At the same time, during the attention fusion process, the attention weights between different modalities at each position are recorded to generate a dynamic cross-modal attention weight map.
[0075] It should be noted that the dynamic cross-modal attention network contains multiple branches, each processing one modality of data, and the network structure is divided into modality branch layers and interaction layers, specifically including the following parts:
[0076] Modal branching layer: each branch It includes a feature extraction module that takes the input modal data Mapping to feature representation Where R represents the spatial feature map, This indicates the spatial dimensions, and C represents the number of channels;
[0077] Interaction layer: Dynamically calculates the similarity between modalities through an attention mechanism to generate an attention weight map;
[0078] The steps for calculating dynamic cross-modal attention weights include feature alignment, similarity calculation, and dynamic weight generation, specifically including the following parts:
[0079] Feature alignment: Aligning features of different modalities Alignment to a uniform dimension via linear projection: , , ,in These represent the learnable parameter matrices, Indicates a query. Indicates key, Represents the value;
[0080] Similarity calculation: Calculate the attention weight of mode m to mode n. : Where d represents the feature dimension, and T represents the dimension of the feature. Transpose the dimensions;
[0081] Dynamic weight generation: For modality m, its fused features for: ,in This represents element-wise multiplication;
[0082] For dynamic cross-modal attention weight map generation, each spatial location is recorded during the attention fusion process. attention weights Generate a weighted graph .
[0083] S4: Compare the fused dynamic cross-modal attention weight map with tasks across different modalities to obtain cross-modal shared features. Combine the fused dynamic cross-modal attention weight map with the temporal information in the modal temporal map to compare tasks across different spatiotemporal periods to obtain temporal dynamic features.
[0084] In this embodiment, comparing the fused dynamic cross-modal attention weight map with tasks across different modalities to obtain cross-modal shared features includes:
[0085] By extracting attention weight features of different modalities from the dynamic cross-modal attention weight map, calculating the attention weight distribution, comparing the attention weight distributions of different modalities on the same task, calculating the difference values between different modalities, identifying information that is jointly concerned by multiple modalities based on the difference values, and using it as a cross-modal shared feature;
[0086] The process of combining the fused dynamic cross-modal attention weight map with the temporal information in the modality temporal map to compare different spatiotemporal tasks yields temporal dynamic features, including:
[0087] Temporal features are extracted by performing temporal convolution on the modal temporal graph. Then, the extracted temporal features are weighted and aggregated using an element-wise multiplication method and a dynamic cross-modal attention weight map as weights. The weighted aggregated features are compared and analyzed with features from different spatiotemporal tasks. By comparing the similarity differences, features with significant discriminative power are selected as discriminative dynamic temporal features.
[0088] It should be noted that, for example, we assume the model processes two modalities, and the attention weight map for each modality is denoted as follows: and The size is For each modality's attention weight map, calculate its global distribution:
[0089] Attention distribution in Mode 1: ;
[0090] Attention distribution in Mode 2: ;
[0091] Cosine similarity is used to measure distribution differences: the cosine similarity is calculated as follows: The distribution difference value is then calculated based on the cosine similarity. ;
[0092] If the difference value Less than the preset difference threshold If so, it is considered that the attention weight of this region is jointly concerned by multiple modalities. The specific steps are as follows:
[0093] For each spatial location Calculate local differences ;
[0094] like ,but To share areas of interest;
[0095] The mean or maximum attention weights of the shared regions are extracted as cross-modal shared features. The specific extraction formula is as follows: .
[0096] The generation of temporal dynamic features includes:
[0097] From modal time series graphs via temporal convolution Extracting temporal features: , where Z represents the convolution kernel;
[0098] Convert the dynamic cross-modal attention map A into temporal weights: ;
[0099] Temporal features are obtained through element-wise multiplication. With weight Combination: ;
[0100] Pooling the weighted features yields the time-series dynamic features: ;
[0101] Will Compare and analyze the characteristics of different spatiotemporal tasks, and calculate The similarity with the template features of each task is considered. If the similarity is significantly different from that of other tasks, it is used as a time-series dynamic feature.
[0102] S5: Combine cross-modal shared features with temporal dynamic features to form cross-modal temporal consistency features, and combine them with cross-modal attention weight maps to train the meta-learning classification model.
[0103] In this embodiment, the training of the meta-learning classification model includes:
[0104] The extracted cross-modal shared features and temporal dynamic features are concatenated to obtain a joint feature vector. Principal component analysis is then used to reduce the dimensionality of the joint feature vector. After concatenation and dimensionality reduction, a feature vector that captures both cross-modal shared information and temporal dynamic change information is obtained, which serves as the cross-modal temporal consistency feature.
[0105] The meta-learning classification model is trained using cross-modal temporal consistency features;
[0106] Export the trained meta-learning classification model into a deployable format and deploy it on different platforms.
[0107] It's important to clarify the use of a few-shot learning meta-training dataset. This dataset typically contains multiple "tasks," each with a small number of labeled samples for training and some unlabeled or poorly labeled samples for evaluating the model's generalization ability on that task. In each meta-training task, extracted "cross-modal temporal consistency features" are fed into the meta-learning classification model. The model learns how to quickly adapt to new tasks using these features and accurately classify them on the query set. The core of meta-learning lies in optimizing the model's initial parameters, enabling it to quickly adapt and achieve good performance with only a small number of gradient updates when encountering a few new samples. The training process is repeated to optimize the model's fundamental parameters, giving it powerful rapid learning and generalization capabilities. After meta-training is complete, the performance of the trained meta-learning model is evaluated using independent validation or test sets to ensure it achieves the expected classification accuracy in few-shot scenarios. The best-performing model version is selected based on the evaluation results.
[0108] S6: Visualize attention weights using heatmaps.
[0109] In this embodiment, the attention visualization includes:
[0110] The attention weights are visualized using heatmaps, which map the attention weights onto the pixels of the image. Areas with higher weight values are represented by brighter colors, and the areas that the model focuses on in the image are shown.
[0111] S7: Export the trained meta-learning classification model and deploy it to a real-world application platform to classify new gastrointestinal endoscopic images.
[0112] In this embodiment, the deployment of the meta-learning classification model includes:
[0113] The trained meta-learning classification model is exported in a deployable format and deployed on different platforms. The deployed model is then applied to actual gastrointestinal endoscopic lesion classification tasks to classify new endoscopic images.
[0114] like Figure 2The embodiment shown provides an implementation system for a training method of a gastrointestinal endoscopic lesion classification model based on few-shot learning. The system includes a modal temporal graph construction module, a three-dimensional spatiotemporal alignment module, an attention weight graph generation module, a feature extraction module, a meta-learning training module, a visualization module, and a model deployment module. The modal temporal graph construction module is connected to the three-dimensional spatiotemporal alignment module, which is connected to the attention weight graph generation module. The attention weight graph generation module is connected to the feature extraction module, which is connected to the meta-learning training module. The attention weight graph generation module is connected to the visualization module, and the meta-learning training module is connected to the model deployment module.
[0115] The modal time series diagram construction module acquires white light endoscopy images, NBI images, and confocal images in the digestive tract by locating the lesion area, marks the lesion area in the white light endoscopy image, and maps the marking information with the NBI image and the confocal image to construct a modal time series diagram.
[0116] Based on the constructed modal time series diagram, the three-dimensional spatiotemporal alignment module unifies the annotation information of different modalities into the same coordinate system to form three-dimensional spatiotemporally aligned multimodal data, and then classifies the aligned multimodal data according to the lesion state.
[0117] The attention weight map generation module performs dynamic cross-modal attention fusion on the classified multimodal data to generate a dynamic cross-modal attention weight map.
[0118] The feature extraction module compares the fused dynamic cross-modal attention weight map with tasks across different modalities to obtain cross-modal shared features. It also combines the fused dynamic cross-modal attention weight map with temporal information in the modal temporal map to compare tasks across different spatiotemporal periods to obtain temporal dynamic features.
[0119] The meta-learning training module combines cross-modal shared features with temporal dynamic features to form cross-modal temporal consistency features, and combines them with cross-modal attention weight maps to train the meta-learning classification model.
[0120] The visualization module visualizes the attention weights using a heatmap.
[0121] The model deployment module exports the trained meta-learning classification model and deploys it to the actual application platform to classify new gastrointestinal endoscopic images.
[0122] In conclusion, the above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the protection scope of the present invention.
[0123] The above description is merely a specific embodiment of this application, but the scope of protection of this application is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in this application should be included within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.
Claims
1. A training method for a gastrointestinal endoscopic lesion classification model based on few-shot learning, characterized in that, include: S1: Obtain white light endoscopy images, NBI images, and confocal images of the digestive tract by locating the lesion area, label the lesion area in the white light endoscopy image, map the labeling information with the NBI image and confocal image, and construct a modal time series diagram; S2: Based on the constructed modal time series diagram, the annotation information of different modalities is unified into the same coordinate system to form three-dimensional spatiotemporally aligned multimodal data, and then the aligned multimodal data is classified according to the lesion state; S3: Perform dynamic cross-modal attention fusion on the classified multimodal data to generate a dynamic cross-modal attention weight map; S4: Compare the fused dynamic cross-modal attention weight map with tasks across different modalities to obtain cross-modal shared features. Combine the fused dynamic cross-modal attention weight map with the temporal information in the modal temporal map to compare tasks across different spatiotemporal periods to obtain temporal dynamic features. S5: Combine cross-modal shared features with temporal dynamic features to form cross-modal temporal consistency features, and combine them with cross-modal attention weight maps to train the meta-learning classification model; S6: Visualize attention weights using heatmaps; S7: Export the trained meta-learning classification model and deploy it to a real-world application platform to classify new gastrointestinal endoscopic images. The steps for generating the dynamic cross-modal attention weight map include: Construct a network structure that dynamically calculates attention weights between different modalities, which contains multiple branches, each branch processes data from one modality, and realizes information interaction between different modalities through an interaction layer; The classified multimodal data are input into the corresponding branches of the dynamic cross-modal attention network. In the interaction layer, the similarity between different modalities is dynamically calculated to obtain the attention weights between different modalities. Based on the attention weights between different modalities, the data of different modalities are fused. At the same time, during the attention fusion process, the attention weights between different modalities at each position are recorded to generate a dynamic cross-modal attention weight map. The step of comparing the fused dynamic cross-modal attention weight map across different modalities to obtain cross-modal shared features includes: By extracting attention weight features of different modalities from the dynamic cross-modal attention weight map, calculating the attention weight distribution, comparing the attention weight distributions of different modalities on the same task, calculating the difference values between different modalities, identifying information that is jointly concerned by multiple modalities based on the difference values, and using it as a cross-modal shared feature; The process of combining the fused dynamic cross-modal attention weight map with the temporal information in the modality temporal map to compare different spatiotemporal tasks yields temporal dynamic features, including: Temporal features are extracted by performing temporal convolution on the modal temporal graph. Then, the extracted temporal features are weighted and aggregated using element-wise multiplication and a dynamic cross-modal attention weight map as weights. The weighted aggregated features are then compared and analyzed with features from different spatiotemporal tasks. By comparing the similarity differences, features with significant discriminative power are selected as the temporal dynamic features.
2. The method for training a gastrointestinal endoscopic lesion classification model based on few-shot learning according to claim 1, characterized in that, The steps for constructing the modal timing diagram include: S101: The lesion area in the digestive tract is initially located using white light endoscopy, and white light endoscopy images of the lesion area are acquired. Subsequently, during the same examination, the system is switched to NBI endoscopy mode to acquire NBI images of the same lesion area. Finally, a confocal laser microendoscopy probe is used to acquire cell-level resolution confocal images of the area. S102: Perform image preprocessing on the acquired white light endoscopy images, NBI images, and cell-level resolution confocal images; S103: Mark the lesion area in the white light endoscopy image. The marking information is automatically mapped to the NBI image and confocal image through image registration technology to achieve spatial alignment of the lesion area between multimodal images. At the same time, modality-specific feature data is extracted from the marked area and combined with the examination timestamp to construct a spatiotemporally aligned modal time series diagram.
3. The method for training a gastrointestinal endoscopic lesion classification model based on few-shot learning according to claim 2, characterized in that, The formation of the three-dimensional spatiotemporally aligned multimodal data includes: Based on the modal time sequence diagram, the annotation information of different modes is transformed from their original coordinate system to a unified coordinate system; The structural similarity between the image containing the transformed labeled region and the image containing the corresponding labeled region in the reference modal image is calculated to verify spatial alignment. If the structural similarity is greater than the preset similarity threshold, the spatial alignment verification is deemed successful. After confirming that the spatial alignment verification is passed, the time difference between different modal acquisitions is compensated to ensure time axis synchronization, and the inter-frame difference of the aligned video is calculated. If the inter-frame difference is less than the preset inter-frame difference threshold, the time alignment verification is confirmed to be passed. The spatially and temporally aligned modal data and their corresponding annotation information are integrated to form three-dimensional spatiotemporally aligned multimodal data. Then, a deep learning model is used to automatically classify the multimodal data according to the lesion state.
4. The method for training a gastrointestinal endoscopic lesion classification model based on few-shot learning according to claim 3, characterized in that, The classification of multimodal data involves dividing the aligned different modal data into subsets according to the lesion state, thereby forming a multimodal subset dataset.
5. The method for training a gastrointestinal endoscopic lesion classification model based on few-shot learning according to claim 1, characterized in that, The training of the meta-learning classification model includes: The extracted cross-modal shared features and temporal dynamic features are concatenated to obtain a joint feature vector. Principal component analysis is then used to reduce the dimensionality of the joint feature vector. After concatenation and dimensionality reduction, a feature vector that captures both cross-modal shared information and temporal dynamic change information is obtained, which serves as the cross-modal temporal consistency feature. The meta-learning classification model is trained using cross-modal temporal consistency features; Export the trained meta-learning classification model into a deployable format and deploy it on different platforms.
6. The method for training a gastrointestinal endoscopic lesion classification model based on few-shot learning according to claim 1, characterized in that, The attention weight visualization includes: The attention weights are visualized using heatmaps, which map the attention weights onto the pixels of the image. Areas with higher weight values are represented by brighter colors, and the areas that the model focuses on in the image are shown.
7. The method for training a gastrointestinal endoscopic lesion classification model based on few-shot learning according to claim 1, characterized in that, The deployment of the meta-learning classification model includes: The trained meta-learning classification model is exported in a deployable format and deployed on different platforms. The deployed model is then applied to actual gastrointestinal endoscopic lesion classification tasks to classify new endoscopic images.