A method for confocal endoscopic image processing and related apparatus

By combining the Inception V3 network and SPP structure with attention mechanism and adaptive nonmaximal suppression, effective images are screened from confocal endoscopic images, solving the problem of low diagnostic efficiency caused by invalid images and improving the accuracy of medical diagnosis.

CN117173747BActive Publication Date: 2026-06-30BIOPSEE (SUZHOU) MEDICAL TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
BIOPSEE (SUZHOU) MEDICAL TECH CO LTD
Filing Date
2023-10-12
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Confocal endoscopy produces a large number of invalid images during medical examinations, causing physicians to spend too much time reviewing the examination and affecting diagnostic efficiency.

Method used

Image features are extracted using the Inception V3 network and SPP structure. Valid images are then selected by combining spatial domain, channel domain, and hybrid domain attention mechanisms with adaptive nonmaximal suppression and a target classifier.

Benefits of technology

It improves the accuracy of image recognition, reduces redundant areas, helps doctors analyze lesions and abnormalities more accurately, and improves the efficiency of medical diagnosis.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN117173747B_ABST
    Figure CN117173747B_ABST
Patent Text Reader

Abstract

This application discloses a method and related equipment for confocal endoscope image processing, relating to the field of image processing. The method includes: inputting the image to be processed into an Inception V3 network and an SPP structure to obtain a first feature, a second feature, and a third feature; extracting a first spatial domain attention map, a first channel domain attention map, and a first mixed domain attention map corresponding to the first feature, the second feature, and the third feature, respectively; performing non-maximum suppression operation on the first spatial domain attention map, the first channel domain attention map, and the first mixed domain attention map to obtain a second spatial domain attention map, a second channel domain attention map, and a second mixed domain attention map; performing a fusion operation on the second spatial domain attention map, the second channel domain attention map, and the second mixed domain attention map to obtain fused features; and performing recognition based on the fused features and a target classifier to obtain a valid image.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This specification relates to the field of image processing, and more specifically, this application relates to a method and related equipment for image processing of a confocal endoscope. Background Technology

[0002] Confocal endoscopy is a medical device that, through the channels of endoscopes such as gastroscopes and colonoscopes, can be inserted into the human body to obtain local histological images, enabling precise diagnosis of minute lesions, gastrointestinal diseases, and early gastrointestinal cancers. According to publicly available data, confocal endoscopes can achieve a frame rate of up to 18 frames per second. Based on this, a 10-minute clinical examination using confocal endoscopy will generate 10,800 frames of images. If the physician reviews the images at a rate of 0.5 seconds per frame, it would take 1.5 hours to review all the images from the examination, which is an extremely time-consuming operation.

[0003] In reality, due to their high magnification and small field of view, confocal imaging produces many invalid images that are too dark, too bright, blurry, or contain motion artifacts. These images cannot provide diagnostic information. Public data shows that approximately half of the images do not contain diagnostic information. If these invalid images can be filtered out, the remaining valid images can be provided to physicians for review, which can greatly reduce the workload of physicians and improve the efficiency of medical institutions. Summary of the Invention

[0004] The summary section introduces a series of simplified concepts, which will be further explained in detail in the detailed description section. This summary section is not intended to limit the key and essential technical features of the claimed technical solution, nor is it intended to determine the scope of protection of the claimed technical solution.

[0005] In a first aspect, this application proposes a method for processing confocal endoscopic images, the method comprising:

[0006] The image to be processed is input into the Inception V3 network and SPP structure to obtain the first feature, the second feature, and the third feature;

[0007] Extract the first spatial domain attention map, the first channel domain attention map, and the first mixed domain attention map corresponding to the first feature, the second feature, and the third feature, respectively;

[0008] A non-maximum suppression operation is performed on the first spatial domain attention map, the first channel domain attention map, and the first mixed domain attention map to obtain a second spatial domain attention map, a second channel domain attention map, and a second mixed domain attention map.

[0009] A fusion operation is performed on the aforementioned second spatial domain attention map, the aforementioned second channel domain attention map, and the aforementioned second hybrid domain attention map to obtain fusion features;

[0010] Based on the above-mentioned fused features and target classifier, identification is performed to obtain effective images;

[0011] Extract the first spatial domain attention map, the first channel domain attention map, and the first mixed domain attention map corresponding to the first feature, the second feature, and the third feature, respectively, including:

[0012] Perform a dimensionality reduction convolution operation on the first feature to obtain the first spatial domain attention map.

[0013] Feature extraction is performed on the second feature to obtain the first channel domain attention map.

[0014] Feature extraction is performed on the third feature to obtain the first mixed domain attention map.

[0015] In one implementation, the above-described inputting the image to be processed into the Inception V3 network and SPP structure to obtain a first feature, a second feature, and a third feature includes:

[0016] The image to be processed is input into the Inception V3 network to extract the first feature and the second feature at different levels.

[0017] The information obtained after the Inception V3 network is input into the SPP structure to obtain the third feature.

[0018] In one embodiment, the first feature is the feature output after passing through three Inception A modules in the InceptionV3 network, and the second feature is the feature output after the Inception C module in the InceptionV3 network.

[0019] In one implementation, the above-mentioned feature extraction of the second feature to obtain the first channel domain attention map includes:

[0020] The SE module is used to extract features from the second feature to obtain the attention map of the first channel domain.

[0021] In one implementation, the above-mentioned feature extraction of the third feature to obtain the first mixed-domain attention map includes:

[0022] The SGNet hybrid domain attention network is used to extract features from the third feature to obtain the first hybrid domain attention map.

[0023] In one embodiment, a fusion operation is performed on the second spatial domain attention map, the second channel domain attention map, and the second hybrid domain attention map to obtain fused features;

[0024] Perform a dot product fusion operation on the first feature and the second spatial domain attention map to obtain the first dot product feature;

[0025] Perform a dot product fusion operation on the above-mentioned second feature and the above-mentioned second channel domain attention map to obtain the second dot product feature;

[0026] Perform a dot product fusion operation on the third feature and the second mixed domain attention map to obtain the third dot product feature;

[0027] Global average pooling and dimensionality reduction concatenation operations are performed on the first, second, and third dot product features to obtain the fused features.

[0028] Secondly, this application also proposes a confocal endoscope image processing device, comprising:

[0029] The first acquisition unit is used to input the image to be processed into the Inception V3 network and SPP structure to obtain the first feature, the second feature and the third feature;

[0030] The extraction unit is used to extract the first spatial domain attention map, the first channel domain attention map, and the first mixed domain attention map corresponding to the first feature, the second feature, and the third feature, respectively.

[0031] The second acquisition unit is used to perform a non-maximum suppression operation on the first spatial domain attention map, the first channel domain attention map, and the first mixed domain attention map to obtain a second spatial domain attention map, a second channel domain attention map, and a second mixed domain attention map.

[0032] The fusion unit is used to perform a fusion operation on the second spatial domain attention map, the second channel domain attention map, and the second hybrid domain attention map to obtain fusion features.

[0033] The recognition unit is used to perform recognition based on the above-mentioned fused features and target classifier to obtain a valid image;

[0034] The extraction unit further includes:

[0035] Perform a dimensionality reduction convolution operation on the first feature to obtain the first spatial domain attention map;

[0036] Feature extraction is performed on the second feature to obtain the first channel domain attention map;

[0037] Feature extraction is performed on the third feature to obtain the first hybrid domain attention map.

[0038] Thirdly, an electronic device includes: a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program stored in the memory to implement the steps of the confocal endoscopic image processing method as described in any of the first aspects above.

[0039] Fourthly, this application also proposes a computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the confocal endoscope image processing method of any of the above claims in the first aspect.

[0040] In summary, the confocal endoscope image processing method of this application includes: inputting the image to be processed into an Inception V3 network and an SPP structure to obtain a first feature, a second feature, and a third feature; extracting a first spatial domain attention map, a first channel domain attention map, and a first mixed domain attention map corresponding to the first feature, the second feature, and the third feature, respectively; performing non-maximum suppression on the first spatial domain attention map, the first channel domain attention map, and the first mixed domain attention map to obtain a second spatial domain attention map, a second channel domain attention map, and a second mixed domain attention map; performing a fusion operation on the second spatial domain attention map, the second channel domain attention map, and the second mixed domain attention map to obtain fused features; and performing recognition based on the fused features and a target classifier to obtain a valid image. The confocal endoscope image processing method proposed in this application, by using feature extraction at different levels in the Inception V3 network and combining it with the SPP structure, can comprehensively integrate multi-scale feature information to better capture details and contextual information in the image. The introduction of spatial domain, channel domain, and mixed domain attention mechanisms during feature extraction allows the model to better focus on important image regions and channels, thereby improving feature quality. Adaptive nonmaximal suppression reduces redundant bounding boxes, and selecting the most relevant parts from multiple attention maps helps improve object detection accuracy. By fusing different attention maps and features, more powerful and information-rich feature representations can be generated, which is beneficial for subsequent image recognition tasks. Using fused features and a trained object classifier for image recognition improves accuracy and filters out truly effective images. In the field of medical imaging, this method helps doctors analyze lesions and abnormalities more accurately, thereby improving the accuracy of medical diagnosis.

[0041] The method for confocal endoscopic image processing proposed in this application, along with other advantages, objectives, and features of this application, will be partly apparent from the following description and partly understood by those skilled in the art through study and practice of this application. Attached Figure Description

[0042] Various other advantages and benefits will become apparent to those skilled in the art upon reading the following detailed description of preferred embodiments. The accompanying drawings are for illustrative purposes only and are not intended to limit this specification. Furthermore, the same reference numerals denote the same parts throughout the drawings. In the drawings:

[0043] Figure 1 This application provides a schematic flowchart of a method for processing confocal endoscope images.

[0044] Figure 2 This is a schematic diagram illustrating the principle of feature extraction provided in an embodiment of this application;

[0045] Figure 3 A schematic diagram illustrating the principle of hybrid domain attention extraction provided in this application embodiment;

[0046] Figure 4 This application provides a schematic diagram illustrating the principle of a non-maximum suppression operation.

[0047] Figure 5 This is a schematic diagram illustrating the principle of feature fusion provided in an embodiment of this application;

[0048] Figure 6 A structural schematic diagram of a confocal endoscope image processing device provided in this application embodiment;

[0049] Figure 7 This is a schematic diagram of an electronic device structure for confocal endoscope image processing provided in an embodiment of this application. Detailed Implementation

[0050] The terms "first," "second," "third," "fourth," etc. (if present) in the specification, claims, and accompanying drawings of this application are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments described herein can be implemented in a sequence other than that illustrated or described herein. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion; for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus. The technical solutions of the embodiments of this application will now be clearly and completely described in conjunction with the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of this application, and not all of them.

[0051] Please see Figure 1 This is a schematic flowchart of a method for processing confocal endoscopic images provided in an embodiment of this application, which specifically includes:

[0052] S110. Input the image to be processed into the Inception V3 network and SPP structure to obtain the first feature, the second feature and the third feature;

[0053] For example, the image to be processed is input into Inception V3 (a classic convolutional network proposed by Google in 2015). This feature extraction network is first pre-trained on the ImageNet dataset (one of the most commonly used datasets for image classification, detection, and localization tasks in deep learning, containing over 14 million images across more than 20,000 categories) and then directly transferred to this network for feature extraction. Since the original network only uses the deepest feature layers of the feature extraction network, it ignores the detailed information in the shallower feature layers. Furthermore, this application incorporates an SPP module to increase the model's receptive field, enabling the separation of significant contextual features.

[0054] The first, second, and third features are the output features of different levels of the Inception V3 network and SPP structure.

[0055] S120. Extract the first spatial domain attention map, the first channel domain attention map, and the first mixed domain attention map corresponding to the first feature, the second feature, and the third feature, respectively.

[0056] For example, the first, second, and third features extracted in the above steps are further processed to obtain a first spatial domain attention map, a first channel domain attention map, and a first mixed domain attention map. The spatial attention map can use convolution operations, where convolution kernels are used to capture the relationships between different locations to produce a spatial domain attention map of the same size as the feature map. The spatial domain attention map can represent the importance of different locations in the image.

[0057] Channel-domain attention maps are typically computed by performing operations along the channel dimension of the feature map. This can be achieved using global pooling and fully connected layers. Global pooling aggregates the feature values ​​of each channel into a single number, and then the channel weights are learned through fully connected layers to generate a channel-domain attention map with the same number of channels, where the weight of each channel represents its importance.

[0058] The hybrid domain attention map is obtained by multiplying the spatial domain attention map and the channel domain attention map. Hybrid domain attention emphasizes features that are considered important in both spatial and channel dimensions, thereby generating a hybrid domain attention map with the same size as the feature map, which takes into account the importance of both spatial and channel dimensions.

[0059] S130. Perform a non-maximum suppression operation on the first spatial domain attention map, the first channel domain attention map, and the first mixed domain attention map to obtain a second spatial domain attention map, a second channel domain attention map, and a second mixed domain attention map.

[0060] For example, while multi-scale multi-attention models can focus on discriminative local regions of an image, they tend to focus on the same region due to the extraction of different attention maps for different scale feature layers, leading to redundancy and wasted computing power. Therefore, this application proposes an Adaptive Non-Maximum Suppression (ANSS) method. A schematic diagram of the ANSS principle is shown in Figure 4. A region box generator selects regions with higher response values ​​from the generated multi-attention map as region boxes. Then, ANSS is applied to these region boxes to reduce redundant regions and merge highly correlated regions. Compared to directly setting a threshold on the generated attention map for filtering, ANSS better preserves attention maps containing discriminative local regions.

[0061] S140. Perform a fusion operation on the second spatial domain attention map, the second channel domain attention map, and the second hybrid domain attention map to obtain fusion features;

[0062] For example, the fused features are calculated by multiplying or weighting the second spatial domain attention map, the second channel domain attention map, and the second hybrid domain attention map with the first feature, the second feature, and the third feature, thereby generating a fused feature map.

[0063] S150. Based on the above-mentioned fusion features and target classifier, identification is performed to obtain a valid image.

[0064] For example, classification and recognition are performed based on fused features and a target classifier to filter out truly effective images. The target classifier is a classifier that has been trained a preset number of times using a set of fused features and can achieve a preset recognition accuracy.

[0065] In summary, the confocal endoscopy image processing method proposed in this application, by using feature extraction at different levels in the Inception V3 network and combining it with the SPP structure, can comprehensively capture multi-scale feature information to better capture details and contextual information in images. The introduction of spatial, channel, and hybrid domain attention mechanisms during feature extraction allows the model to better focus on important image regions and channels, thereby improving feature quality. Adaptive nonmaximal suppression reduces redundant region boxes, selecting the most relevant parts from multiple attention maps, which helps improve the accuracy of target detection. By fusing different attention maps and features, more powerful and information-rich feature representations can be generated, which is helpful for subsequent image recognition tasks. Using fused features and a trained target classifier for image recognition can improve the accuracy of image recognition and filter out truly effective images. In the field of medical imaging, this method, by extracting effective images, can help doctors analyze lesions and abnormalities more accurately, thereby improving the accuracy of medical diagnosis.

[0066] In one implementation, the image to be processed is input into the Inception V3 network and the SPP structure to obtain a first feature, a second feature, and a third feature, including...

[0067] The image to be processed is input into the Inception V3 network to extract the first feature and the second feature at different levels.

[0068] The information obtained after the Inception V3 network is input into the SPP structure to obtain the third feature.

[0069] For example, the image to be processed is first input into the Inception V3 network. The Inception V3 network is a deep convolutional neural network that, after pre-training on the large-scale ImageNet dataset, can be used for image processing tasks. The input image undergoes a series of operations such as convolution, pooling, and activation layers to gradually extract the image's feature information. The Inception V3 network contains multiple layers, each outputting feature representations at different levels. Selecting two different levels of features as the first and second features can be achieved by accessing the feature maps of the corresponding levels within the network. The first and second features represent abstract information from different levels, from low-level texture to high-level semantic information.

[0070] The output of the Inception V3 network is passed to the SPP structure. The SPP structure is a feature enhancement structure that comprehensively considers information from both spatial and channel dimensions. The output of the Inception V3 network is used as input, and then the SPP structure extracts the third feature. The SPP structure performs a series of operations on the Inception V3 output, including spatial domain attention and channel domain attention, to obtain the third feature. This feature is adjusted by the SPP structure to highlight the importance of spatial relationships and channel information.

[0071] The first, second, and third features represent image information from different levels and perspectives, which helps improve the performance of image processing and recognition tasks. They can fully utilize the advantages of different network structures and levels in deep learning to obtain richer feature representations.

[0072] In one embodiment, the first feature is the feature output after passing through three Inception A modules in the InceptionV3 network, and the second feature is the feature output after the Inception C module in the InceptionV3 network.

[0073] For example, Figure 2 This is a schematic diagram illustrating the principle of feature extraction in an embodiment of this application. The output of features_mixed_5d after passing through three Inception A modules in InceptionV3 is extracted as the first feature F1. The output of features_mixed_6e after passing through one InceptionB and four Inception C modules is then extracted as the second feature F2.

[0074] The SPP (Spatial Pyramid Pooling) structure is added after the last feature layer (features_mixed_6e) of the feature extraction network. After performing three BasicConv2d convolutional layers on the last feature layer, ... Figure 2 The conv×3 in the model is processed using four different scales of max pooling, with the pooling kernel sizes being 13x13, 9x9, 5x5, and 1x1 (1x1 means no processing). SPP can greatly increase the receptive field and separate the most significant contextual features, so the output of SPP is used as the third feature F3.

[0075] It should be noted that, Figure 2 In the diagram, the InceptionA module actually refers to Mixed_5_b, Mixed_5_c, and Mixed_5_d, which is a structural block encapsulated in the InceptionV3 network. The InceptionA module mainly contains 7 convolutional layers and one pooling layer. The InceptionB module refers to Mixed_6_a, and the InceptionC module refers to Mixed_6_b, Mixed_6_c, Mixed_6_d, and Mixed_6_e, all of which are structural blocks encapsulated in the InceptionV3 network. The SPP structure in the diagram is the network structure after the output of InceptionV3. In the diagram, convx3 has passed through three BasicConv2d convolutional layers.

[0076] In one implementation, the extraction of a first spatial domain attention map, a first channel domain attention map, and a first mixed domain attention map corresponding to the first feature, the second feature, and the third feature, respectively, includes:

[0077] Perform a dimensionality reduction convolution operation on the first feature to obtain the first spatial domain attention map.

[0078] Feature extraction is performed on the second feature to obtain the first channel domain attention map.

[0079] Feature extraction is performed on the third feature to obtain the first mixed domain attention map.

[0080] For example, the first feature is a high-dimensional feature representation that contains abstract information from different levels. To obtain the first spatial domain attention map, this high-dimensional feature can be processed using a convolution operation. The convolution operation scans the entire feature map with an appropriately sized kernel and generates a spatial domain attention map of the same size as the feature map. This operation helps capture the relationships between different locations, thus highlighting the importance of different regions in the image.

[0081] The second feature is typically a high-dimensional feature representation containing rich channel information. To obtain the first channel attention map, specific feature extraction methods can be used, such as global pooling and fully connected layers. First, global pooling can be used to aggregate the feature values ​​of each channel into a single number. Then, a fully connected layer is used to learn the weights of each channel, thereby generating a channel-domain attention map with the same number of channels as the number of channels, where the weight of each channel represents its importance in the task.

[0082] The third feature is typically processed using the SPP structure, which includes attention mechanisms for both the spatial and channel domains. To obtain the first hybrid domain attention map, the feature representation after the SPP structure is used. This feature representation already considers the importance of spatial and channel dimensions, and therefore can be directly used as the hybrid domain attention map. The three feature layers F1, F2, and F3 obtained in step S110 are used to extract their first spatial domain attention map A1, first channel domain attention map A2, and hybrid domain attention map A3, respectively.

[0083] In one implementation, the spatial domain attention map A1 is obtained by directly performing a dimensionality reduction convolution operation on the feature layer F1 with a kernel size of 1.

[0084] In one implementation, the above-mentioned feature extraction of the second feature to obtain the first channel domain attention map includes:

[0085] The SE module is used to extract features from the second feature to obtain the attention map of the first channel domain.

[0086] For example, the Squeeze-and-Excitation Module (SE) is an attention mechanism commonly used in Convolutional Neural Networks (CNNs) to enhance the network's attention to specific feature channels, thereby improving model performance. Using the SE module to extract channel attention A2 from the F2 feature layer, the SE module first performs a Squeeze operation on the feature map obtained from convolution to obtain global features at the channel level. Then, it performs an Excitation operation on the global features to learn the relationships between different channels and obtain the weights of different channels. Finally, it multiplies the result by the original feature map to obtain the final feature. Essentially, the SE module performs attention or gating operations along the channel dimension. This attention mechanism allows the model to focus more on the channel features with the most information, while suppressing less important channel features.

[0087] In one implementation, the above-mentioned feature extraction of the third feature to obtain the first mixed-domain attention map includes:

[0088] The SGNet hybrid domain attention network is used to extract features from the third feature to obtain the first hybrid domain attention map.

[0089] For example, the underlying principle of the SGNet hybrid domain attention network for feature extraction is shown in Figure 3. First, the image feature map F3 is sent to the channel attention branch to generate the channel domain attention map. The channel attention map contains a global average pooling (GAP) layer and two fully connected (FC) layers with different activation functions. Then, element-wise multiplication is used to merge the channel attention branch with the feature map F to generate the map.

[0090] The channel attention map has the same dimensions as the image. Then, a convolution operation is performed on the resulting channel attention map, similar to that in FG-CNN. Therefore, the final attention map of the mixed channel domains is obtained. It can be expressed as the following formula:

[0091] (1)

[0092] In one embodiment, a fusion operation is performed on the second spatial domain attention map, the second channel domain attention map, and the second hybrid domain attention map to obtain fused features;

[0093] Perform a dot product fusion operation on the first feature and the second spatial domain attention map to obtain the first dot product feature;

[0094] Perform a dot product fusion operation on the above-mentioned second feature and the above-mentioned second channel domain attention map to obtain the second dot product feature;

[0095] Perform a dot product fusion operation on the third feature and the second mixed domain attention map to obtain the third dot product feature;

[0096] Global average pooling and dimensionality reduction concatenation operations are performed on the first, second, and third dot product features to obtain the fused features.

[0097] For example, the attention map focuses on the defective parts, allowing the model to pay more attention to the regions with discriminative power, which can effectively enhance the data. This application uses a bilinear average pooling (BAP) feature fusion part to fuse the attention layer and the feature layer, strengthening effective features that help distinguish samples, suppressing ineffective features, and improving the model's performance. The BAP network structure is shown in Figure 5. First, the feature map and the attention map are multiplied and fused. Finally, the resulting features are reduced in dimensionality using global average pooling (GAP) and concatenated to form the final fused features, which are then input into the classification layer for classification.

[0098] After obtaining the attention layer and the feature layer, since each attention layer points to a specific part of the object in the image, in order to enhance the features, this study multiplies the attention layer and the feature layer element by element to obtain the Part Feature Map.

[0099] (2)

[0100] The obtained feature maps are reduced in dimensionality using GAP or Global Max Pooling (GMP) to obtain a set of one-dimensional tensors, which are then used to extract discriminative local features.

[0101] (3)

[0102] The tensors obtained from each set of features are concatenated to obtain the fused features of the final linear classification layer feature matrix.

[0103] (4)

[0104] Please see Figure 6 One embodiment of the confocal endoscope image processing apparatus in this application may include:

[0105] The first acquisition unit 21 is used to input the image to be processed into the Inception V3 network and SPP structure to obtain the first feature, the second feature and the third feature;

[0106] Extraction unit 22 is used to extract the first spatial domain attention map, the first channel domain attention map, and the first mixed domain attention map corresponding to the first feature, the second feature, and the third feature, respectively;

[0107] The second acquisition unit 23 is used to perform a non-maximum suppression operation on the first spatial domain attention map, the first channel domain attention map, and the first mixed domain attention map to obtain a second spatial domain attention map, a second channel domain attention map, and a second mixed domain attention map.

[0108] Fusion unit 24 is used to perform a fusion operation on the second spatial domain attention map, the second channel domain attention map, and the second hybrid domain attention map to obtain fusion features;

[0109] The recognition unit 25 is used to perform recognition based on the above-mentioned fused features and target classifier to obtain a valid image.

[0110] like Figure 7 As shown, this application embodiment also provides an electronic device 300, including a memory 310, a processor 320, and a computer program 311 stored in the memory 310 and executable on the processor. When the processor 320 executes the computer program 311, it implements the steps of any of the above-described methods for confocal endoscope image processing.

[0111] Since the electronic device described in this embodiment is the device used to implement a confocal endoscope image processing apparatus in the embodiments of this application, those skilled in the art can understand the specific implementation method and various variations of the electronic device in this embodiment based on the method described in the embodiments of this application. Therefore, how the electronic device implements the method in the embodiments of this application will not be described in detail here. Any device used by those skilled in the art to implement the method in the embodiments of this application falls within the scope of protection of this application.

[0112] In practical implementation, when the computer program 311 is executed by the processor, it can achieve the following: Figure 1 Any of the corresponding implementation methods in the embodiments.

[0113] It should be noted that the descriptions of each embodiment in the above embodiments have different focuses. For parts that are not described in detail in a certain embodiment, please refer to the relevant descriptions in other embodiments.

[0114] Those skilled in the art will understand that embodiments of this application can be provided as methods, systems, or computer program products. Therefore, this application can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, this application can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.

[0115] This application is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of this application. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create a machine for implementing the flowchart illustrations. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.

[0116] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.

[0117] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.

[0118] This application also provides a computer program product, which includes computer software instructions. When the computer software instructions are executed on a processing device, the processing device performs the confocal endoscope image processing flow in the corresponding embodiment.

[0119] A computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the flow or function according to the embodiments of this application is generated. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another. For example, computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer-readable storage medium may be any available medium that a computer can store or a data storage device such as a server or data center that integrates one or more available media. The available medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid-state disk (SSD)).

[0120] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the specific working processes of the systems, devices, and units described above can be referred to the corresponding processes in the foregoing method embodiments, and will not be repeated here.

[0121] In the several embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some interfaces, or indirect coupling or communication connection between apparatuses or units, and may be electrical, mechanical, or other forms.

[0122] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.

[0123] Furthermore, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.

[0124] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods of the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0125] The above embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit it. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this application.

Claims

1. A method for processing confocal endoscopic images, characterized in that, include: The image to be processed is input into the Inception V3 network and SPP structure to obtain the first feature, the second feature, and the third feature; Extracting the first spatial domain attention map, the first channel domain attention map, and the first mixed domain attention map corresponding to the first feature, the second feature, and the third feature, respectively, including: Perform a dimensionality reduction convolution operation on the first feature to obtain the first spatial domain attention map; Feature extraction is performed on the second feature to obtain the first channel domain attention map, including: The SE module is used to extract features from the second feature to obtain the first channel domain attention map; Feature extraction is performed on the third feature to obtain the first hybrid domain attention map, including: The third feature is extracted using the SGNet hybrid domain attention network to obtain the first hybrid domain attention map; A non-maximum suppression operation is performed on the first spatial domain attention map, the first channel domain attention map, and the first hybrid domain attention map to obtain a second spatial domain attention map, a second channel domain attention map, and a second hybrid domain attention map. A fusion operation is performed on the second spatial domain attention map, the second channel domain attention map, and the second hybrid domain attention map to obtain fused features; The fused features and target classifier are used to identify and obtain valid images.

2. The confocal endoscopic image processing method according to claim 1, characterized in that, The process of inputting the image to be processed into the Inception V3 network and SPP structure to obtain the first feature, the second feature, and the third feature includes: The image to be processed is input into the Inception V3 network to extract the first feature and the second feature at different levels; The information obtained after the Inception V3 network is input into the SPP structure to obtain the third feature.

3. The confocal endoscopic image processing method according to claim 1, characterized in that, The first feature is the feature output after passing through three Inception A modules in the InceptionV3 network, and the second feature is the feature output after the Inception C module in the InceptionV3 network.

4. The confocal endoscopic image processing method according to claim 1, characterized in that, A fusion operation is performed on the second spatial domain attention map, the second channel domain attention map, and the second hybrid domain attention map to obtain fused features; Perform a dot product fusion operation on the first feature and the second spatial domain attention map to obtain the first dot product feature; Perform a dot product fusion operation on the second feature and the second channel domain attention map to obtain the second dot product feature; Perform a dot product fusion operation on the third feature and the second mixed domain attention map to obtain the third dot product feature; The first dot product feature, the second dot product feature, and the third dot product feature are subjected to global average pooling and dimensionality reduction concatenation operations to obtain the fused feature.

5. A confocal endoscope image processing device, characterized in that, include: The first acquisition unit is used to input the image to be processed into the Inception V3 network and SPP structure to obtain the first feature, the second feature and the third feature; The extraction unit is used to extract the first spatial domain attention map, the first channel domain attention map, and the first mixed domain attention map corresponding to the first feature, the second feature, and the third feature, respectively. The second acquisition unit is used to perform a non-maximum suppression operation on the first spatial domain attention map, the first channel domain attention map, and the first mixed domain attention map to obtain a second spatial domain attention map, a second channel domain attention map, and a second mixed domain attention map. The fusion unit is used to perform a fusion operation on the second spatial domain attention map, the second channel domain attention map, and the second hybrid domain attention map to obtain fused features; A recognition unit is used to perform recognition based on the fused features and the target classifier to obtain a valid image; The extraction unit further includes: Perform a dimensionality reduction convolution operation on the first feature to obtain the first spatial domain attention map; Feature extraction is performed on the second feature to obtain the first channel domain attention map, including: The SE module is used to extract features from the second feature to obtain the first channel domain attention map; Feature extraction is performed on the third feature to obtain the first hybrid domain attention map, including: The third feature is extracted using the SGNet hybrid domain attention network to obtain the first hybrid domain attention map.

6. An electronic device, comprising: The memory and processor are characterized in that the processor, when executing a computer program stored in the memory, implements the confocal endoscopic image processing method as described in any one of claims 1-4.

7. A computer-readable storage medium having a computer program stored thereon, characterized in that: When the computer program is executed by the processor, it implements the confocal endoscope image processing method as described in any one of claims 1-4.