OCT image classification method
By employing a dual-branch OCT image classification method and utilizing multi-attention unit enhancement features, the low classification accuracy caused by the mixed transmission of pathological features and basic structural features in existing technologies has been resolved, enabling precise localization and accurate classification of microhemorrhagic points and drusen.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- 江苏富翰医疗产业发展有限公司
- Filing Date
- 2025-07-18
- Publication Date
- 2026-06-26
AI Technical Summary
In existing OCT image classification methods, the mixed transmission of pathological features and basic structural features within the same channel leads to the loss of retinal internal limiting membrane reflection features during channel compression. Channel attention mechanisms weaken the response of local lesions when calculating weights globally. Convolutional kernels with fixed receptive fields are unable to adapt to the size changes of drusen. Single-type attention mechanisms cause frequency domain feature confusion when processing images where low-frequency features of exudate and high-frequency features of neovascularization coexist, resulting in low classification accuracy.
An OCT image classification method with a dual-branch structure is adopted. The first branch is used to extract features, and the second branch enhances features through multiple attention units, including channel attention module, filter attention module, spatial attention module and convolution kernel attention module, which respectively strengthen the key frequency band response, spatial position weight and convolution kernel parameter dynamic adjustment, forming a synergistic enhancement of pathological features and structural features.
It improves the classification accuracy of OCT images, can accurately locate tiny hemorrhages in the fovea of the macula, and adapts to changes in drusen of different sizes, reduces frequency domain feature confusion, and enhances the ability to identify lesion areas.
Smart Images

Figure CN120807476B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of computer vision technology, and in particular to an OCT image classification method. Background Technology
[0002] Diagnostic diabetic retinopathy relies on optical coherence tomography (OCT) to acquire cross-sectional images of the retina. This equipment uses near-infrared interferometry to generate micron-resolution images of biological tissue structures. Clinically, the goal is to achieve simultaneous identification of microaneurysms and neovascularization, requiring algorithms with both spatial localization and feature discrimination capabilities for lesions, while also being compatible with the morphological differences in drusen of varying sizes.
[0003] Convolutional neural network schemes employ a single-branch structure to perform end-to-end classification tasks. Among them, improved models based on the ResNet50 architecture utilize multi-layer convolutional kernels to extract multi-scale features. Some schemes embed channel attention modules in the last two layers to optimize feature selection, or add global spatial pooling operations before the fully connected layer to enhance position awareness. Other methods apply Fourier filtering in the preprocessing stage to suppress frequency domain noise interference.
[0004] The single-branch structure forces pathological features and basic structural features to be mixed and transmitted within the same channel, resulting in the loss of retinal internal limiting membrane reflection features during channel compression; the channel attention mechanism weakens the response of local lesions when calculating weights globally, making it impossible to accurately locate tiny hemorrhages in the fovea of the macula; the convolution kernel with a fixed receptive field has difficulty adapting to the size changes of drusen across a wide range; the single-type attention mechanism causes frequency domain feature confusion when processing images where low-frequency features of exudate and high-frequency features of neovascularization coexist. In summary, this leads to low accuracy in OCT image classification. Summary of the Invention
[0005] This application provides an OCT image classification method to solve the problem of low accuracy in OCT image classification.
[0006] This application provides an OCT image classification method, including:
[0007] Acquire an image to be classified and a classification network, wherein the image to be classified is an OCT image, and the classification network includes a processing layer; the processing layer includes a first branch and at least two second branches.
[0008] The image to be classified is input into the first branch to output an optimized feature map, and the first branch is used to extract features;
[0009] The optimized feature map is input into the second branch, and after being processed by the multi-attention unit of the second branch, a classification result is output. The second branch is used to enhance the features.
[0010] In some feasible embodiments, the first branch includes a feature splitting unit and a first feature optimization unit, the first feature optimization unit including a first processing unit, a second processing unit and a third processing unit; the convolution kernel size of the first processing unit and the third processing unit is a first size, and the convolution kernel size of the second processing unit is a second size;
[0011] The step of inputting the image to be classified into the first branch to output an optimized feature map includes:
[0012] The image to be classified is input into the feature splitting unit to perform feature channel splitting processing and output the first sub-feature and the second sub-feature.
[0013] The first sub-feature is sequentially input into the first processing unit, the second processing unit, and the third processing unit, so that the first processing unit performs the first convolution processing, the second processing unit performs the second convolution processing, and the third processing unit performs the third convolution processing, and the first optimized sub-feature is output.
[0014] The second sub-feature is input into the fourth convolution kernel processing unit to perform the fourth convolution processing and output the second optimized sub-feature.
[0015] The first optimized sub-feature and the second optimized sub-feature are input into the activation function unit to perform feature fusion activation processing and output the optimized feature map.
[0016] In some feasible embodiments, the second branch includes a feature recombination unit, a channel segmentation unit, a multi-attention unit, and a channel splicing unit;
[0017] The step of inputting the optimized feature map into the second branch further includes:
[0018] The optimized feature map is input into the feature recombination unit to perform feature channel recombination processing and output the main channel feature and auxiliary channel feature.
[0019] The main channel features are input into the channel segmentation unit to perform multi-channel segmentation processing and output four sets of feature sub-channels.
[0020] The feature sub-channels are respectively input into the multi-attention unit to perform multi-dimensional feature enhancement processing and output optimized sub-channel features;
[0021] The four sets of optimized sub-channel features are input into the channel splicing unit to perform feature splicing processing and output optimized main channel features.
[0022] The optimized main channel features and the auxiliary channel features are input into the feature fusion unit to perform channel fusion processing and output a classification feature map.
[0023] In some feasible embodiments, the multi-attention unit includes a channel attention module, a filter attention module, a spatial attention module, and a convolutional kernel attention module;
[0024] The optimized sub-channel features include channel-weighted features, frequency domain optimized features, spatial weighted features, and dynamic convolution features;
[0025] The method further includes:
[0026] The feature sub-channels are input into the channel attention module to perform channel dimension weight calculation and output channel weighted features.
[0027] The feature sub-channel is input into the filtering attention module to perform frequency domain feature optimization processing and output frequency domain optimized features.
[0028] The feature sub-channels are input into the spatial attention module to perform spatial location weight calculation and output spatial weighted features.
[0029] The feature sub-channels are input into the convolution kernel attention module to perform dynamic adjustment of the convolution kernel parameters and output dynamic convolution features.
[0030] In some feasible embodiments, the multi-attention unit further includes a second feature optimization unit, which is connected to the channel attention module, the filter attention module, the spatial attention module, and the convolutional kernel attention module, respectively.
[0031] The second feature optimization unit includes a fourth processing unit, a fifth processing unit, and a sixth processing unit; the kernel size of the fourth and sixth processing units is a first size, and the kernel size of the fifth processing unit is a second size;
[0032] The optimized sub-channel features include a third optimized sub-feature, a fourth optimized sub-feature, a fifth optimized sub-feature, and a sixth optimized sub-feature;
[0033] The step of inputting the four sets of optimized sub-channel features into the channel stitching unit to perform feature stitching processing and output optimized main channel features includes:
[0034] The channel weighted features are sequentially input into the fourth processing unit, the fifth processing unit, and the sixth processing unit to output the third optimized sub-feature;
[0035] The frequency domain optimization features are sequentially input into the fourth processing unit, the fifth processing unit, and the sixth processing unit to output the fourth optimization sub-feature;
[0036] The spatial weighted features are sequentially input into the fourth processing unit, the fifth processing unit, and the sixth processing unit to output the fifth optimized sub-feature;
[0037] The dynamic convolutional features are sequentially input into the fourth, fifth, and sixth processing units to output the sixth optimized sub-feature;
[0038] The third, fourth, fifth, and sixth optimized sub-features are input into the channel splicing unit to output optimized main channel features, wherein the third, fourth, fifth, and sixth optimized sub-features are features after transformation processing.
[0039] In some feasible embodiments, the processing layer includes a first feature processing stage, a second feature processing stage, a third feature processing stage, and a fourth feature processing stage;
[0040] The output classification results include:
[0041] The image to be classified is input into the first feature processing stage to perform preliminary feature extraction processing and output the first stage feature map.
[0042] The first-stage feature map is input into the second feature processing stage to perform intermediate feature extraction processing and output the second-stage feature map.
[0043] The second-stage feature map is input into the third feature processing stage to perform deep feature extraction processing and output the third-stage feature map.
[0044] The third-stage feature map is input into the fourth feature processing stage to perform high-level feature integration processing and output the classification result.
[0045] The first feature processing stage includes two second branches, the second feature processing stage includes three second branches, the third feature processing stage includes five second branches, and the fourth feature processing stage includes two second branches.
[0046] In some feasible embodiments, the classification network further includes a global feature aggregation unit, a linear mapping unit, and a probability transformation unit;
[0047] The step of inputting the third-stage feature map into the fourth feature processing stage to perform high-level feature integration processing and output classification results includes:
[0048] The third-stage feature map is input into the fourth feature processing stage to perform multi-feature fusion processing and output fused features.
[0049] The fused features are input into the global feature aggregation unit to perform spatial dimension compression processing and output an aggregated feature vector.
[0050] The aggregated feature vector is input into the linear mapping unit to perform feature dimension transformation processing and output the initial classification result.
[0051] The initial classification result is input into the probability transformation unit to output the category probability distribution, which represents the classification result.
[0052] In some feasible embodiments, the classification network further includes an input layer, which includes a seventh processing unit and a spatial compression unit;
[0053] Before inputting the image to be classified into the first branch to output an optimized feature map, the process further includes:
[0054] The image to be classified is input into the seventh processing unit to perform initial convolution processing and output an initial feature map.
[0055] The initial feature map is input into the spatial compression unit to perform spatial dimension reduction processing and output a size-reduced feature map.
[0056] The size-reduced feature map is input into the first branch to perform feature optimization processing and output an optimized feature map.
[0057] In some feasible embodiments, the output classification result is followed by:
[0058] The weight values of the balance factor are determined by the sample category identifier;
[0059] Calculate the sample attention weight value based on the classification results and focus factors;
[0060] The adjustment coefficient is generated by multiplying the weight value of the balancing factor by the weight value of the sample attention.
[0061] The final loss value is output by multiplying the base loss value by the adjustment coefficient.
[0062] In some feasible embodiments, determining the balance factor weight value through sample category identification includes:
[0063] If the sample category is identified as the target category, the weight value of the balance factor is determined to be the first weight value;
[0064] If the sample category is identified as a non-target category, the weight value of the balance factor is determined to be the second weight value;
[0065] Wherein, the sum of the first weight value and the second weight value is 1.
[0066] As can be seen from the above technical solutions, this application provides an OCT image classification method. The method includes acquiring an image to be classified and a classification network. The image to be classified is an OCT image, and the classification network includes a processing layer. The processing layer includes a first branch and at least two second branches. The image to be classified is input into the first branch to output an optimized feature map, and the first branch is used to extract features. The optimized feature map is input into the second branch to be processed by the multi-attention unit of the second branch and then output a classification result, and the second branch is used to enhance features. The method separates the feature extraction and enhancement paths through a dual-branch approach. The first branch maintains the original channels as base features; the second branch divides the remaining channels into four groups of processing sub-channels to form a synergistic enhancement of pathological features and structural features, thereby solving the problem of low accuracy in OCT image classification. Attached Figure Description
[0067] To more clearly illustrate the technical solution of this application, the drawings used in the embodiments will be briefly introduced below. Obviously, for those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0068] Figure 1 A schematic flowchart of the OCT image classification method provided in the embodiments of this application;
[0069] Figure 2 A schematic diagram of the classification network structure provided in the embodiments of this application;
[0070] Figure 3 A schematic diagram of the first branch structure provided in an embodiment of this application;
[0071] Figure 4 This is a schematic diagram of the second branch structure provided in an embodiment of this application;
[0072] Figure 5A schematic diagram illustrating the training loss and training accuracy of the original ResNet50 classification network provided in this embodiment of the application;
[0073] Figure 6 A schematic diagram showing the training loss and training accuracy of the original ResNet50 classification network adjusted by FocalLoss, provided for an embodiment of this application.
[0074] Figure 7 This is a schematic diagram illustrating the training loss and training accuracy of the classification network provided in this application after adjustment by FocalLoss, as provided in an embodiment of this application. Detailed Implementation
[0075] The embodiments will now be described in detail, examples of which are illustrated in the accompanying drawings. When the following description relates to the drawings, unless otherwise indicated, the same numbers in different drawings represent the same or similar elements. The embodiments described below do not represent all embodiments consistent with this application. They are merely examples of systems and methods consistent with some aspects of this application as detailed in the claims.
[0076] like Figure 1 As shown, this application provides an OCT image classification method, including:
[0077] S110: Obtain the image to be classified and the classification network.
[0078] The image to be classified refers to a cross-sectional image of the fundus structure acquired by an optical coherence tomography (OCT) device, i.e., an OCT image. More specifically, in this embodiment, the image to be classified refers to a cross-sectional image of the macular region acquired by an OCT device, including typical pathological features of age-related macular degeneration (AMD) such as drusen, geographic atrophy, and choroidal neovascularization. The image presents the layered structure of the retina at a resolution of 512×512 pixels.
[0079] The classification network employs a deep convolutional neural network architecture, consisting of multiple cascaded processing layers. It receives image data and outputs the probability distribution of pathological categories. The main body of the network uses a hierarchical feature processing framework, with the initial layer capturing basic texture features and the deeper layers extracting abstract semantic features.
[0080] The classification network provided in this embodiment is a neural network architecture based on an improvement of ResNet50. For example... Figure 2 As shown, the classification network includes an input layer and a processing layer. The processing layer includes multiple feature processing stages, and each feature processing stage includes a first branch and at least two second branches.
[0081] The first branch is used to extract features, and the second branch is used to enhance features. The first branch extracts the strong reflective features of drusen, and the second branch forms an optimized feature map of lesion enhancement.
[0082] S120: Input the image to be classified into the first branch to output an optimized feature map.
[0083] After the image to be classified is input into the classification network, it passes through the input layer and enters the first branch of the feature processing stage, where the first molecule performs lesion feature extraction.
[0084] For example, after obtaining macular OCT images of AMD patients, the images are input into the classification network processing layer, and the first branch performs lesion feature extraction.
[0085] like Figure 3 As shown, in some embodiments, the first branch includes two paths: a first path for feature segmentation and a second path for feature optimization. For example, the first path extracts the strong reflectivity features of drusen, while the second path maintains the integrity of the outer retinal structure.
[0086] Specifically, the first branch includes a feature splitting unit and a first feature optimization unit. The feature splitting unit is a channel segmentation module that divides the input feature map equally along the channel dimension into two independent sub-feature groups through tensor slicing operations.
[0087] The first feature optimization unit includes a first processing unit, a second processing unit, and a third processing unit. The convolutional kernel size of the first and third processing units is a first size, and the convolutional kernel size of the second processing unit is a second size. The first size is 1×1, and the second size is 3×3. The first processing unit uses a 1×1 convolutional kernel to perform feature channel dimension compression. The second processing unit uses a 3×3 convolutional kernel to perform feature space structure enhancement. The third processing unit uses a 1×1 convolutional kernel to perform feature channel dimension restoration. The three units are concatenated sequentially to form a feature refinement chain, maintaining the same input and output space dimensions.
[0088] The image to be classified is input into the feature splitting unit to perform feature channel splitting processing and output the first sub-feature and the second sub-feature. After the image to be classified is input into the feature splitting unit, the channel dimension is split according to a preset ratio.
[0089] Taking a 512-channel input as an example: the first 256 channels output as the first sub-feature, and the last 256 channels output as the second sub-feature. Both features maintain the same spatial size.
[0090] The first sub-feature is sequentially input into the first processing unit, the second processing unit, and the third processing unit, so that the first processing unit performs the first convolution process, the second processing unit performs the second convolution process, and the third processing unit performs the third convolution process, and the first optimized sub-feature is output.
[0091] The first sub-feature is input into the first processing unit to perform channel dimensionality reduction processing. A 1×1 convolutional kernel compresses 512 channels to 256 channels. The resulting feature is input into the second processing unit to perform spatial relationship modeling. A 3×3 convolutional kernel enhances the local correlation of features while keeping the number of channels unchanged. The processing result is input into the third processing unit to perform channel expansion processing. A 1×1 convolutional kernel restores 256 channels to 512 channels, generating the first optimized sub-feature.
[0092] The second sub-feature is input into the fourth convolution kernel processing unit to perform the fourth convolution processing and output the second optimized sub-feature. The fourth convolution kernel processing unit is a 1×1 size convolution kernel that performs feature information preservation processing, performs a linear transformation on the input feature and preserves the original feature distribution characteristics, and outputs a feature map with the same number of input channels.
[0093] The first optimized sub-feature and the second optimized sub-feature are input into the activation function unit to perform feature fusion activation processing and output the optimized feature map.
[0094] The activation function unit consists of an adder and a nonlinear activation function. The adder performs a two-path feature tensor element-wise superposition operation, and the output is input to the ReLU nonlinear activation function to generate an optimized feature map with zero-value suppression. The first and second optimized sub-features undergo element-wise superposition in the adder of the activation function unit. The superposition result is input to the ReLU activation function, which sets negative values in the feature matrix to zero and retains positive feature responses, ultimately outputting an optimized feature map with nonlinear expression characteristics.
[0095] S130: Input the optimized feature map into the second branch, and after processing by the multi-attention unit of the second branch, output the classification result.
[0096] like Figure 4 As shown, the second branch includes a feature reorganization unit and a multi-attention unit. The feature reorganization unit splits the optimized feature map into main channels and auxiliary channels, with the main channel further divided into four groups of feature sub-channels. Each group of sub-channels is input into the channel attention module, filtering attention module, spatial attention module, and convolutional kernel attention module for parallel processing. The outputs of each module are then processed by the feature fusion unit to reconstruct the channel dimensions, generating a discriminative classification feature map.
[0097] The purpose of the second branch channel segmentation is to retain half of the original channels as baseline features to ensure that basic information is not lost, while the other half of the channels are dedicated to learning various attention features to form a complementary effect. Channel attention enhances the sensitivity to lesions in different layers of the retina, spatial attention accurately locates macular abnormalities, filtering attention enhances the frequency components specific to lesions, including drusen and geographic atrophy, and convolutional kernel attention can adjust the receptive field and shape because lesions in fundus OCT imaging vary in size from person to person.
[0098] The multi-attention unit includes a channel attention module, a filtering attention module, a spatial attention module, and a convolutional kernel attention module; the channel attention module calculates the channel weight matrix to enhance the response of key frequency bands; the filtering attention module optimizes the frequency feature distribution; the spatial attention module generates a region weight map; and the convolutional kernel attention module dynamically adjusts the convolution parameters.
[0099] For typical pathological features of age-related macular degeneration (AMD), channel attention enhances high-frequency reflected signals from lesions such as verrucous deposits; spatial attention focuses on subretinal fluid accumulation areas; filtered attention suppresses RPE layer fragmentation artifacts; and convolutional kernel attention adapts to the irregular morphology of neovascularization. The multi-mechanism outputs are fused to generate a pathological feature map.
[0100] The method captures micron-sized drusen and atrophic areas simultaneously through the first branch, and processes drusen deposits, leakage areas, and vascular morphology through four types of attention separation in the second branch. The auxiliary channel preserves the photoreceptor layer intact, avoiding misjudgment of ellipsoidal band breakage. The output features can characterize CNV leakage and atrophic lesions.
[0101] In some embodiments of the improved ResNet50 classification network of this application, the classification network further includes an input layer, which includes a seventh processing unit and a spatial compression unit. The seventh processing unit performs initial convolution processing, extracts basic features from the input image using a preset convolution kernel, sets the kernel size to cover the retinal feature scale, extracts the layered structure information of the OCT image through multi-channel filtering, and maintains the original input size in the spatial resolution of the output feature map, while expanding the number of channels to a preset dimension.
[0102] The spatial compression unit performs spatial dimension reduction processing through max pooling, reduces spatial resolution through local region feature sampling operations, and adopts a fixed step size downsampling algorithm to compress spatial dimension while maintaining the integrity of feature distribution. The height and width of the output feature map are reduced simultaneously, while the channel dimension remains unchanged.
[0103] The image to be classified is input into the seventh processing unit to perform initial convolution processing and output an initial feature map; the initial feature map is input into the spatial compression unit to perform spatial dimension reduction processing and output a size-reduced feature map; the size-reduced feature map is input into the first branch to perform feature optimization processing and output an optimized feature map.
[0104] The image to be classified has a size of 224×224×3. After processing by the seventh processing unit, the generated initial feature map is 112×112×64. Then, through max pooling, a reduced feature map of size 56×56×64 is obtained.
[0105] In some embodiments, the processing layer includes a first feature processing stage, a second feature processing stage, a third feature processing stage, and a fourth feature processing stage. Each of the multiple processing stages includes a first branch and a second branch. The output of the last second branch after the fourth feature processing stage is the classification result.
[0106] The scaled-down feature map obtained from the input layer is fed into the first feature processing stage, i.e., processed through the first branch.
[0107] The first feature processing stage includes one first branch and two second branches, the second feature processing stage includes one first branch and three second branches, the third feature processing stage includes one first branch and five second branches, and the fourth feature processing stage includes one first branch and two second branches.
[0108] In the first feature processing stage, preliminary feature extraction is performed. After the first branch in the first feature processing stage, the size of the output feature map is 56×56×256. Then, two second branches are output in sequence. The output of the first second branch is input to the second second branch. The feature map output by the second branch has the same size as the output of the first branch.
[0109] In some embodiments, the second branch includes a feature reconstruction unit, a channel segmentation unit, a multi-attention unit, and a channel stitching unit. The feature reconstruction unit performs channel-dimensional feature reconstruction, splitting the input feature map into a main channel and an auxiliary channel according to a preset ratio. In OCT image processing, the main channel carries lesion feature information, while the auxiliary channel retains tissue structure features. After splitting, the features of the two channels maintain their original spatial dimensions.
[0110] The channel segmentation unit equally divides the main channel features into four independent feature sub-channels along the channel dimension. Each sub-channel is allocated a fixed proportion of channel resources and receives feature information from the same spatial location. The four feature sub-channels process feature information from different dimensions in parallel.
[0111] The channel stitching unit performs sub-channel feature integration operation, stitching the four sets of optimized sub-channel features sequentially along the channel dimension. The stitching process reconstructs the complete number of channels according to the input order, and the output feature map has the same spatial resolution as the input.
[0112] The feature fusion unit optimizes the main channel features and auxiliary channel features by splicing them along the channel dimension. By increasing the channel dimension, it integrates two types of feature information, and the number of channels in the output feature map is equal to twice the number of channels in the original input feature map.
[0113] In some embodiments, the optimized feature map is input into the feature reorganization unit to perform feature channel reorganization processing and output main channel features and auxiliary channel features.
[0114] The main channel features are input into the channel segmentation unit to perform multi-channel segmentation processing and output four sets of feature sub-channels.
[0115] The feature sub-channels are respectively input into the multi-attention unit to perform multi-dimensional feature enhancement processing and output optimized sub-channel features;
[0116] The four sets of optimized sub-channel features are input into the channel splicing unit to perform feature splicing processing and output optimized main channel features.
[0117] The optimized main channel features and the auxiliary channel features are input into the feature fusion unit to perform channel fusion processing and output a classification feature map.
[0118] After optimizing the feature map input feature recombination unit, a segmentation operation is performed according to a preset channel ratio. For example, 70% of the channels constitute the main channel, and 30% of the channels constitute the auxiliary channel. The main channel feature input channel segmentation unit is divided into four groups of feature sub-channels.
[0119] Each feature sub-channel is input to multiple attention units for independent processing. The first group performs channel dimension correlation optimization; the second group performs frequency domain feature enhancement; the third group strengthens spatial location weights; and the fourth group adapts to local feature morphology. Each group's processing maintains the original spatial resolution.
[0120] The four sets of optimized sub-channel features are input into the channel splicing unit, and the main channel is reassembled according to the original segmentation order. The reassembled optimized main channel features and auxiliary channel features are input into the feature fusion unit, and spliced along the channel dimension to form a complete classification feature map.
[0121] When OCT images of AMD lesions are input, the feature reconstruction unit assigns macular edema features to the main channel and preserves the retinal layered structure in the auxiliary channel. After channel segmentation, the first set of sub-channels focuses on lesion areas such as drusen and geographic atrophy, and the lesion discrimination features are enhanced by the multi-attention unit; the auxiliary channel completely preserves the information on the outer retinal layer rupture. The final fused feature map includes both lesion enhancement features and tissue structure features.
[0122] The multi-attention unit includes a channel attention module, a filter attention module, a spatial attention module, and a convolutional kernel attention module.
[0123] The channel attention module performs channel dimension weight calculation and generates a channel importance coefficient matrix through global feature analysis. This matrix includes a channel compression layer and a weight generation layer. The compression layer calculates the feature statistics of the spatial dimension, and the weight generation layer assigns weighting coefficients to each channel accordingly. The output feature map scales the response intensity of each channel according to its weight value.
[0124] The feature sub-channels are input into the channel attention module to perform channel dimension weight calculation processing and output channel weighted features. After the feature sub-channels are input into the channel attention module, the global feature pooling layer calculates the feature mean of the spatial dimension and generates channel statistical vectors. The weight generation layer maps the statistical vectors into channel weight sequences. Each channel of the original feature map is multiplied by the corresponding weight coefficient and outputs channel weighted features to enhance the feature response of the discriminative channel.
[0125] The filtering attention module performs frequency domain feature optimization processing, enhancing key frequency components through frequency domain transformation and bandpass filtering. This includes a Fourier transform layer, a frequency response matrix, and an inverse Fourier transform layer. The frequency response matrix dynamically adjusts the passband range based on the characteristic spectral properties, and the output signal retains the key frequency domain mode features.
[0126] The feature sub-channel is input into the filtering attention module to perform frequency domain feature optimization processing and output frequency domain optimized features. The feature sub-channel is input into the filtering attention module to perform frequency domain transformation. The Fourier transform layer transforms the spatial features to the frequency domain and generates a complex spectrum. The frequency response matrix adjusts the spectral amplitude according to preset rules, suppresses noise frequency bands and amplifies lesion feature frequency bands. The inverse Fourier transform reconstructs the spatial features and outputs a frequency domain optimized feature map.
[0127] The spatial attention module performs spatial location weight calculations, generating a two-dimensional attention map of the same size as the input feature map. Local region statistical features are extracted through a spatial feature pooling layer, and then a weight mapping layer generates location-related importance coefficients. The feature value of each spatial location is multiplied by a corresponding weight coefficient to enhance the response in key regions.
[0128] The feature sub-channels are input into the spatial attention module to perform spatial position weight calculation processing and output spatial weighted features. The feature map is processed by the spatial feature pooling layer to extract local maximum features and generate dimensionality-reduced spatial features. The weight mapping layer expands the dimensionality-reduced features into a two-dimensional weight matrix of the original size. The input feature position values are multiplied by the corresponding matrix coefficients to generate a spatial weighted feature map.
[0129] The convolutional kernel attention module performs dynamic adjustment of the convolutional kernel parameters, generating optimized convolutional kernel parameters based on the characteristics of the input features. This includes a kernel weight generation network that learns the local structural patterns of the input features and outputs an adaptive convolutional kernel weight matrix. The output feature map includes the result of the dynamic kernel convolution operation.
[0130] The feature sub-channels are input into the convolution kernel attention module to perform dynamic adjustment of the convolution kernel parameters and output dynamic convolution features. The kernel weights of the channel attention module generate a network that analyzes the local structural patterns of the input features. The network outputs convolution kernel weight parameters through learning. These parameters are convolved with the input feature map to output a dynamic convolution feature map that dynamically adapts to the features. The operation process maintains the original spatial resolution.
[0131] Channel attention processing addresses the response imbalance problem in different lesion areas, ensuring that the features of tiny bleeding points are not drowned out by background noise; filter attention processing enhances the ability to extract frequency domain features, improving the recognition of lesions such as drusen and geographic atrophy; spatial attention processing breaks through the receptive field limitations of traditional convolution, accurately locating focal lesion areas; and convolution kernel attention processing dynamically adjusts kernel parameters to adapt to the differences in morphological features of tortuous and deformed blood vessels.
[0132] In the analysis of macular cleft lesions, spatial attention locates the edge region of the crack, channel attention enhances the full-layer fracture characteristic response, convolution kernel attention adapts to the fracture morphology of the ellipsoidal band, and filtering attention improves the frequency domain signal-to-noise ratio of the disappearance of the central foveal reflection, which can significantly improve the analysis accuracy of layered structural anomalies.
[0133] In some embodiments, the multi-attention unit further includes a second feature optimization unit, which includes a fourth processing unit, a fifth processing unit, and a sixth processing unit, which are connected sequentially.
[0134] The convolutional kernels of the fourth and sixth processing units have the first size, i.e., 1×1. The fourth processing unit performs feature dimensionality reduction and integration processing, compressing the channel dimension of the input features through the convolutional kernel while preserving the key feature response patterns. The number of channels in the output feature map is reduced to 1 / N of the input (N is a preset compression coefficient).
[0135] The fifth processing unit has a convolution kernel size of the second size, namely 3×3. The 3×3 convolution kernel performs feature space correlation enhancement processing. By expanding the receptive field of the convolution kernel, it captures the local structural correlation of features and optimizes the feature space distribution pattern. The feature map space size and number of channels remain unchanged before and after processing.
[0136] The sixth processing unit performs feature channel dimension restoration processing, which expands the compressed number of channels to the original dimension through convolution kernels, reconstructs the integrity of the feature channels, and outputs a feature map with a number of channels equal to the original value before input to the fourth processing unit.
[0137] In this embodiment, each attention module is connected to a second feature optimization unit, meaning that the outputs of each attention module are processed by the second feature optimization unit, specifically by the fourth, fifth, and sixth processing units. The different attention modules will be described below.
[0138] For the channel attention module, the channel weighted features are sequentially input into the fourth, fifth, and sixth processing units to output the third optimized sub-feature. Specifically, a 1×1 convolutional kernel performs feature channel dimensionality reduction, compressing the 256 channels to 128 channels. The dimensionality-reduced features are input into the fifth processing unit. A 3×3 convolutional kernel performs spatial feature structure optimization to enhance the local correlation of the lesion area. The processing result is input into the sixth processing unit. The 1×1 convolutional kernel restores the channels to 256 channels and outputs the third optimized sub-feature.
[0139] Similar to the channel attention module, the filtering attention module, spatial attention module, and convolutional kernel attention module are all sequentially input into the fourth processing unit, the fifth processing unit, and the sixth processing unit to obtain the fourth optimized sub-feature, the fifth optimized sub-feature, and the sixth optimized sub-feature, which will not be elaborated here.
[0140] Four sets of optimized sub-feature input channel splicing units are obtained and spliced and integrated along the channel dimension in a preset order: the third optimized sub-feature (256 channels), the fourth optimized sub-feature (256 channels), the fifth optimized sub-feature (256 channels), and the sixth optimized sub-feature (256 channels) are spliced into an optimized main channel feature of 1024 channels, and the optimized main channel feature optimizes the neural fiber layer reflection feature.
[0141] In the second branch of optimizing the main channel feature input to the first feature processing stage, the processing process of the second branch is the same as that of the first branch, but the input is the output of the first branch, which is used to enhance the continuity recognition of the RPE layer, that is, to output the first stage feature map, the size of which is 56×56×256.
[0142] The first-stage feature map is then input into the second-stage feature processing stage to perform intermediate feature extraction, outputting a second-stage feature map. The internal processing of the first and second branches in the second-stage feature processing stage is the same as that in the first-stage feature processing stage. Further, the first branch in the second-stage feature processing stage performs 2x downsampling, the first second branch extracts the clustered distribution pattern of drusen, the second second branch enhances the boundaries of early atrophic areas, and the third second branch detects subretinal fluid-filled dark areas. The second-stage feature map has a size of 28×28×512, with channels expanded to 512 dimensions and spatial resolution compressed by 50%.
[0143] The second-stage feature map is then input into the third feature processing stage to perform deep feature extraction, outputting a third-stage feature map. The internal processing of the first and second branches in the third feature processing stage is the same as that in the first feature processing stage. Further, the first branch in the third feature processing stage uses residual connection entry points; the first and second branches collaboratively analyze the morphology of choroidal neovascularization (CNV); the third branch performs three-dimensional reconstruction of pigmented epithelial detachment (PED); and the fourth and fifth branches perform microstructural analysis of the map-like atrophic area. The size of the third-stage feature map is 14×14×1024.
[0144] Finally, the feature map from the third stage is input into the fourth feature processing stage to perform high-level feature integration processing and output the classification result. The internal processing of the first and second branches in the fourth feature processing stage is the same as that in the first feature processing stage. Furthermore, the first branch in the fourth feature processing stage is a multi-scale feature fusion layer, the first second branch is a high-level semantic feature refinement, and the second second branch performs spatial compression. The classification result channel compression and spatial aggregation work together, and the output size of the second branch is 7×7×2048.
[0145] The classification result can be a probability vector, for example, probability vector P = [P1, P2, P3], where p1 represents the probability of no macular degeneration, p2 represents the probability of dry AMD (drusen deposition type), and p3 represents the probability of wet AMD (choroidal neovascularization type).
[0146] The classification results can also be visualized as a heatmap, generating a lesion localization map through gradient-weighted class activation mapping (Grad-CAM), with red areas representing lesion areas and blue areas representing non-lesion areas.
[0147] To obtain classification results, in some embodiments, the classification network further includes a global feature aggregation unit, a linear mapping unit, and a probability transformation unit. The global feature aggregation unit performs spatial dimension compression processing, converting the two-dimensional feature map into a one-dimensional feature vector through pooling operations, and employing a global mean pooling algorithm to calculate the mean statistic of the spatial dimension for each channel. The output vector dimension is consistent with the number of channels in the input feature map, achieving spatial information integration. After passing through the global feature aggregation unit, the feature map size is transformed from 7×7×2048 to 1×1×2048.
[0148] The linear mapping unit performs feature dimension transformation processing, realizing high-dimensional feature space mapping through a fully connected neural network layer. The input feature vector is processed by the linear weight matrix, and the output dimension corresponds to the numerical sequence of the preset number of classification categories. The weight matrix parameters are optimized and learned during the training process.
[0149] The probability transformation unit performs probability distribution calculation processing, transforming the numerical sequence into a probability distribution through a normalized exponential function. The ratio of each element in the input numerical sequence after exponential operation to the sum of all elements is calculated to generate probability values for each category. The sum of the output probability value sequence is a fixed constant. The linear mapping unit outputs 1000-dimensional classification probabilities.
[0150] The third-stage feature map is input into the fourth feature processing stage to perform multi-feature fusion processing and output fused features. The fused features are then input into the global feature aggregation unit to perform spatial dimension compression processing and output aggregated feature vectors.
[0151] For the input feature map (number of channels C × height H × width W), the average pixel value of each channel is calculated in the height and width dimensions, and the output vector dimension is C×1, forming an aggregated feature vector that represents the overall features.
[0152] The aggregated feature vector is input into the linear mapping unit to perform feature dimension transformation processing and output an initial classification result; the initial classification result is input into the probability transformation unit to output a category probability distribution, which represents the classification result.
[0153] A weight matrix of dimension K (K×C) is multiplied by the input vector (C×1) to generate an initial classification result vector of dimension K×1. Each element of this vector corresponds to the original output score of the classification network. Then, an exponential operation is performed on each element in the vector to increase the score advantage of positive values. The ratio of the exponential value of each element to the sum of the exponential values of all elements is calculated, and the output probability distribution vector satisfies that each element value ∈ [0,1] and the sum is 1.
[0154] For classification results, existing technologies use traditional cross-entropy loss. However, cross-entropy loss is not optimized in conjunction with the feature enhancement module, and it is easily dominated by simple negative samples in scenarios with extremely imbalanced samples. The formula for cross-entropy loss is shown below:
[0155] BCE(p t ) = -log(p t );
[0156] Where, p t It is the probability predicted by the classification model for the correct category.
[0157] In some embodiments, the balance factor weight value is determined by the sample category identifier, the sample attention weight value is calculated based on the classification result and the focus factor, the balance factor weight value and the sample attention weight value are multiplied to generate the adjustment coefficient, and the base loss value is multiplied by the adjustment coefficient to output the final loss value.
[0158] The sample category identifier is a discrete coded signal representing the pathological type of the image, obtained through data annotation. In the OCT classification task, the identifier value corresponds to predefined categories such as normal retina and macular edema, serving as benchmark reference data for supervised learning.
[0159] Based on the dynamically assigned weight coefficients according to the sample category identifier, in some embodiments, if the sample category identifier is the target category, the weight value of the balance factor is determined to be a first weight value; if the sample category identifier is a non-target category, the weight value of the balance factor is determined to be a second weight value.
[0160] Wherein, the sum of the first weight value and the second weight value is 1, and the balance factor weight value α t To balance the influence of positive and negative samples and prevent negative samples from contributing too much to the loss: for positive samples α t =α, for negative samples α t = 1-α, where α is usually between [0,1], representing the weight ratio of positive and negative samples.
[0161] For example, a first weight value (α) is assigned when the sample category identifier points to the target pathological category (such as choroidal neovascularization); a second weight value (1-α) is assigned when it points to a non-target category (such as normal macular structure). The two strictly follow the weight and constant constraint relationship.
[0162] Focus factor γ: By introducing the focus factor, FocalLoss adjusts the model's focus on easily classified and difficult-to-classify samples. (1-p) in the formula... t ) γ Part is key; see the formula:
[0163]
[0164] When the prediction probability p t Close to 1 (i.e., the sample is easy to classify), (1-p) t ) γ It will be very small, reducing the contribution to the loss. When the predicted probability p t Approaching 0 (i.e., the sample is difficult to classify), (1-p) t ) γ This will be very large, increasing the weight of the loss and thus making the model pay more attention to difficult samples.
[0165] like Figure 5 As shown, the classification task is performed using the original ResNet50 classification network. In the graph on the left (5a), the training loss is represented, which measures the deviation between the model's prediction and the true label. The lower the loss value, the better the model fits. In the initial stage (0-25 epochs), the loss drops sharply from 0.4731 to 0.0763, and in the subsequent stages (25-175 epochs), it remains stable at a plateau of 0.3-0.7.
[0166] The chart on the right (5b) represents the training accuracy, which is the proportion of samples correctly predicted by the model. The higher the accuracy, the stronger the classification ability. The accuracy increased from 62% to 93% in the first 25 epochs, and fluctuated between 90% and 97% from 50 to 175 epochs, with a final accuracy of 98.05%.
[0167] like Figure 6 As shown, 6a and 6b are similar to 5a and 5b. In the initial optimization, the loss from 0 to 25 epochs is 0.5 to 0.07 and continues to converge. From 25 to 175 epochs, the loss steadily approaches 0. The accuracy curve shows that 90% is achieved at 25 epochs. The learning rate is dynamically adjusted to accelerate convergence, and the final accuracy is 98.54%.
[0168] like Figure 7As shown, 7a and 7b are similar to 5a and 5b. The training loss of the classification network provided in this application continues to converge and converges smoothly throughout the process, with a final accuracy of 98.93%.
[0169] As can be seen, the classification network provided in this application separates structural features from pathological features through a dual-branch architecture, enhances the response to microlesions through channel attention, accurately locates macular lesions through spatial attention, removes frequency domain feature confusion through filtering attention, and adapts to the size range of lesions through convolutional kernel attention, thereby improving classification accuracy.
[0170] Similar parts between the embodiments provided in this application can be referred to mutually. The specific implementation methods provided above are only a few examples under the overall concept of this application and do not constitute a limitation on the scope of protection of this application. For those skilled in the art, any other implementation methods extended from the solution of this application without creative effort shall fall within the scope of protection of this application.
Claims
1. An OCT image classification method, characterized in that, include: Acquire an image to be classified and a classification network, wherein the image to be classified is an OCT image, and the classification network includes a processing layer; the processing layer includes a first branch and at least two second branches. The image to be classified is input into the first branch to output an optimized feature map, and the first branch is used to extract features; The optimized feature map is input into the second branch, and after being processed by the multi-attention unit of the second branch, the classification result is output. The second branch is used to enhance the features. The second branch includes a feature recombination unit, a channel segmentation unit, a multi-attention unit, and a channel splicing unit; The step of inputting the optimized feature map into the second branch further includes: The optimized feature map is input into the feature recombination unit to perform feature channel recombination processing and output the main channel feature and auxiliary channel feature. The main channel features are input into the channel segmentation unit to perform multi-channel segmentation processing and output four sets of feature sub-channels. The feature sub-channels are respectively input into the multi-attention unit to perform multi-dimensional feature enhancement processing and output optimized sub-channel features; The four sets of optimized sub-channel features are input into the channel splicing unit to perform feature splicing processing and output optimized main channel features. The optimized main channel features and the auxiliary channel features are input into the feature fusion unit to perform channel fusion processing and output a classification feature map.
2. The OCT image classification method according to claim 1, characterized in that, The first branch includes a feature splitting unit, a first feature optimization unit, and a fourth convolutional kernel processing unit. The first feature optimization unit includes a first processing unit, a second processing unit, and a third processing unit. The convolutional kernel size of the first processing unit and the third processing unit is a first size, and the convolutional kernel size of the second processing unit is a second size. The step of inputting the image to be classified into the first branch to output an optimized feature map includes: The image to be classified is input into the feature splitting unit to perform feature channel splitting processing and output the first sub-feature and the second sub-feature. The first sub-feature is sequentially input into the first processing unit, the second processing unit, and the third processing unit, so that the first processing unit performs the first convolution processing, the second processing unit performs the second convolution processing, and the third processing unit performs the third convolution processing, and the first optimized sub-feature is output. The second sub-feature is input into the fourth convolution kernel processing unit to perform the fourth convolution processing and output the second optimized sub-feature. The first optimized sub-feature and the second optimized sub-feature are input into the activation function unit to perform feature fusion activation processing and output the optimized feature map.
3. The OCT image classification method according to claim 1, characterized in that, The multi-attention unit includes a channel attention module, a filter attention module, a spatial attention module, and a convolutional kernel attention module; The optimized sub-channel features include channel-weighted features, frequency domain optimized features, spatial weighted features, and dynamic convolution features; The method further includes: The feature sub-channels are input into the channel attention module to perform channel dimension weight calculation and output channel weighted features. The feature sub-channel is input into the filtering attention module to perform frequency domain feature optimization processing and output frequency domain optimized features. The feature sub-channels are input into the spatial attention module to perform spatial location weight calculation and output spatial weighted features. The feature sub-channels are input into the convolution kernel attention module to perform dynamic adjustment of the convolution kernel parameters and output dynamic convolution features.
4. The OCT image classification method according to claim 3, characterized in that, The multi-attention unit further includes a second feature optimization unit, which is connected to the channel attention module, the filter attention module, the spatial attention module, and the convolutional kernel attention module, respectively. The second feature optimization unit includes a fourth processing unit, a fifth processing unit, and a sixth processing unit; the kernel size of the fourth and sixth processing units is a first size, and the kernel size of the fifth processing unit is a second size; The optimized sub-channel features include a third optimized sub-feature, a fourth optimized sub-feature, a fifth optimized sub-feature, and a sixth optimized sub-feature; The step of inputting the four sets of optimized sub-channel features into the channel stitching unit to perform feature stitching processing and output optimized main channel features includes: The channel weighted features are sequentially input into the fourth processing unit, the fifth processing unit, and the sixth processing unit to output the third optimized sub-feature; The frequency domain optimization features are sequentially input into the fourth processing unit, the fifth processing unit, and the sixth processing unit to output the fourth optimization sub-feature; The spatial weighted features are sequentially input into the fourth processing unit, the fifth processing unit, and the sixth processing unit to output the fifth optimized sub-feature; The dynamic convolutional features are sequentially input into the fourth, fifth, and sixth processing units to output the sixth optimized sub-feature; The third, fourth, fifth, and sixth optimized sub-features are input into the channel splicing unit to output optimized main channel features, wherein the third, fourth, fifth, and sixth optimized sub-features are features after transformation processing.
5. The OCT image classification method according to claim 1, characterized in that, The processing layer includes a first feature processing stage, a second feature processing stage, a third feature processing stage, and a fourth feature processing stage. The output classification results include: The image to be classified is input into the first feature processing stage to perform preliminary feature extraction processing and output the first stage feature map. The first-stage feature map is input into the second feature processing stage to perform intermediate feature extraction processing and output the second-stage feature map. The second-stage feature map is input into the third feature processing stage to perform deep feature extraction processing and output the third-stage feature map. The third-stage feature map is input into the fourth feature processing stage to perform high-level feature integration processing and output the classification result. The first feature processing stage includes two second branches, the second feature processing stage includes three second branches, the third feature processing stage includes five second branches, and the fourth feature processing stage includes two second branches.
6. The OCT image classification method according to claim 5, characterized in that, The classification network also includes a global feature aggregation unit, a linear mapping unit, and a probability transformation unit; The step of inputting the third-stage feature map into the fourth feature processing stage to perform high-level feature integration processing and output classification results includes: The third-stage feature map is input into the fourth feature processing stage to perform multi-feature fusion processing and output fused features. The fused features are input into the global feature aggregation unit to perform spatial dimension compression processing and output an aggregated feature vector. The aggregated feature vector is input into the linear mapping unit to perform feature dimension transformation processing and output the initial classification result. The initial classification result is input into the probability transformation unit to output the category probability distribution, which represents the classification result.
7. The OCT image classification method according to claim 1, characterized in that, The classification network also includes an input layer, which includes a seventh processing unit and a space compression unit. Before inputting the image to be classified into the first branch to output an optimized feature map, the process further includes: The image to be classified is input into the seventh processing unit to perform initial convolution processing and output an initial feature map. The initial feature map is input into the spatial compression unit to perform spatial dimension reduction processing and output a size-reduced feature map. The size-reduced feature map is input into the first branch to perform feature optimization processing and output an optimized feature map.
8. The OCT image classification method according to claim 1, characterized in that, The output classification result includes: The weight values of the balance factor are determined by the sample category identifier; Calculate the sample attention weight value based on the classification results and focus factors; The adjustment coefficient is generated by multiplying the weight value of the balancing factor by the weight value of the sample attention. The final loss value is output by multiplying the base loss value by the adjustment coefficient.
9. The OCT image classification method according to claim 8, characterized in that, The process of determining the weight value of the balance factor through sample category identification includes: If the sample category is identified as the target category, the weight value of the balance factor is determined to be the first weight value; If the sample category is identified as a non-target category, the weight value of the balance factor is determined to be the second weight value; Wherein, the sum of the first weight value and the second weight value is 1.