A refined method for fine-tuning small targets by combining improved pyramid pooling and edge branch monitoring.

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
The method enhances small target segmentation by combining improved pyramid pooling and edge branch monitoring, addressing feature disappearance and category imbalance, achieving high accuracy and clear boundaries in a lightweight model.

JP7875569B1Active Publication Date: 2026-06-18GUILIN UNIV OF ELECTRONIC TECH

View PDF 3 Cites 0 Cited by

Patent Information

Authority / Receiving Office: JP · JP
Patent Type: Patents
Current Assignee / Owner: GUILIN UNIV OF ELECTRONIC TECH
Filing Date: 2025-12-25
Publication Date: 2026-06-18

Application Information

Patent Timeline

25 Dec 2025

Application

18 Jun 2026

Publication

JP7875569B1

IPC: G06T7/00; G06N3/0464; G06T7/11; G06V10/40

AI Tagging

Application Domain

Image analysis Character and pattern recognition

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Conventional semantic segmentation networks struggle with low accuracy in processing small targets due to feature disappearance, discrepancies in receptive fields, and category imbalance, leading to blurred boundaries and insufficient attention on small targets.

Method used

A method combining improved pyramid pooling and edge branch monitoring, utilizing MobileNet V3 for feature extraction, enhanced feature maps, and a feature calibration enhancement module to optimize feature quality and boundary clarity.

Benefits of technology

Significantly improves small target segmentation accuracy and boundary clarity while maintaining a lightweight model suitable for mobile devices and real-time systems.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure 0007875569000001_ABST

Patent Text Reader

Abstract

This provides a method for finely partitioning small targets by combining improved pyramid pooling with edge branch monitoring. [Solution] A dataset is selected, the dataset images are preprocessed, and the preprocessed images are further feature-extracted via the backbone network MobilenetV3 to generate four sets of feature maps F1 to F4. F1 and F2 are input to the ABG module to generate feature map F5 containing edge information, F4 is sent to an improved ASPP structure to obtain feature map F6 containing multiscale context information, and at the same time, F3 and F4 are input to the FCE module for feature fusion to complement spatial details, and finally, F5 and F6 are combined and fused, then enhanced by the FCE module, and in combination with monitoring of feature maps F3 and F4, the images are upsampled stepwise to the original size to obtain the final segmented images.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The present invention relates to the technical field of computer vision, and specifically, to a method for fine segmentation of small targets that combines improved pyramid pooling and edge branch monitoring.

Background Art

[0002] Deep learning has been widely applied in the application field of computer vision in the past decade. On the other hand, semantic segmentation is one of the core tasks in computer vision, aiming to assign semantic labels to each pixel in an image. With the development of deep learning, fully convolutional networks (FCNs), U-Net, and DeepLab series models (e.g., DeepLabV3+) have already become representative architectures in this field. These models capture multi-scale context information through modules such as encoder-decoder structures, feature fusion, and spatial pyramid pooling, significantly improving the segmentation accuracy.

[0003] However, the currently mainstream semantic segmentation networks still have significant defects in processing small targets. Small targets (e.g., traffic signs, pedestrians, bicycles, micro-lesions in medical images, etc.) usually occupy a very small proportion in an image, and the accuracy during segmentation is usually very low. The difficulties in their segmentation are mainly due to the following points.

[0004] Problem of feature disappearance: In the downsampling process of the deep network, the detailed information and spatial features of small targets are easily buried or disappear, so that the deep feature map cannot effectively represent small targets.

[0005] Discrepancy in receptive fields: Large receptive field modules (e.g., dilated convolution) designed to recognize large targets bring an excessive smoothing effect to small targets and lack sufficient detailed responses, resulting in blurred boundaries and ultimately the disappearance of the targets.

[0006] Category imbalance: Smaller targets, due to their low pixel occupancy in the image, are more susceptible to the influence of the major categories during model training and receive insufficient attention from the model.

[0007] To address the above issues, several improvements have been proposed to conventional technologies. For example, HRNet preserves spatial detail by maintaining high-resolution feature maps, some methods enhance boundary sensing by introducing attention mechanisms or edge monitoring, and hybrid dilating convolution (HDC) is used to mitigate the "gridding effect" of dilating convolution. Furthermore, Transformer-based models (e.g., SegFormer) have recently been widely applied in the field of semantic partitioning and can effectively model long-range dependencies, but their extremely high number of parameters and computational complexity make them difficult to deploy to mobile devices or real-time systems.

[0008] This invention features a structure in which MobileNet V3 is used for feature extraction, an enhanced feature F5 is generated for shallow features using the ABG module, and an improved multi-supported ASPP is generated for deep features using F6. Furthermore, an FCE module is introduced to enhance the vocabulary information of sub-targets, and the sub-target information is integrated with high quality during the fusion stage of F5 and F6. This differs from the general-purpose multi-scale fusion structure described in the prior art (e.g., CN116797910B) and is an improved design that combines a sub-target vocabulary enhancement module with a lightweight main component.

[0009] Although many efforts have been made so far, conventional technologies still suffer from technical bottlenecks in semantic segmentation tasks with small targets, such as difficulty in balancing model complexity and segmentation accuracy, insufficient feature extraction of small targets, and significant loss of boundary details. [Prior art documents] [Patent Documents]

[0010] [Patent Document 1] China Patent Publication CN116797910B [Overview of the Initiative] [Problems that the invention aims to solve]

[0011] The objective of this invention is to provide a method for finely partitioning small targets by combining improved pyramid pooling and edge branch monitoring, which is intended to provide a novel method that not only maintains model lightness but also significantly improves the partitioning accuracy and boundary clarity of small targets. [Means for solving the problem]

[0012] To achieve the above objective, the present invention provides a method for fine-tuning small targets by combining improved pyramid pooling and edge branch monitoring, Step 1 involves selecting and preprocessing the dataset to be split, and fixing the size to 512x512. Step 2 involves inputting a preprocessed dataset into MobilenetV3, extracting feature maps, and obtaining four sets of feature maps F1, F2, F3, and F4. Step 3 involves inputting feature maps F1 and F2 into the ABG module to generate feature map F5, Step 4 involves inputting the feature map F4 into an improved ASPP structure and performing multi-branch fusion to generate the feature map F6, Step 5 involves implementing the FCE module to enhance the semantic features of small goals and optimize feature quality. This provides an improved method for fine-tuning small targets by fusing pyramid pooling and edge branch monitoring, which includes step 6, inputting feature maps F5 and F6 into an FCE module for feature fusion, further performing upsampling and cross-layer fusion, and outputting the partitioning result.

[0013] As options, the core principles to be followed during preprocessing in Step 1 include not destroying the spatial alignment relationship between the image and the label, not losing important semantic information, and adapting to the input requirements of the model.

[0014] As an option, in Step 2, the four feature maps F1, F2, F3, and F4 each represent different stages in the feature extraction process of the backbone network MobilenetV3, all belonging to the bottleneck unit of MobileNetV3, the size of the four feature maps decreases sequentially, the number of channels increases sequentially, the size of F1 is 1 / 4 of the original image, the size of F2 is 1 / 8 of the original image, the size of F3 is 1 / 16 of the original image, and the size of F4 is 1 / 32 of the original image.

[0015] As an option, the execution process for Step 3 is: Step 3.1 involves adjusting the size to match, that is, adjusting the number of channels in F1 and F2 to match using 1x1 convolution, and then performing stride 2 downsampling on F1 or upsampling on F2. Step 3.2 combines the aligned feature maps and enhances the nonlinear representation using the ReLU activation function, Step 3.3 involves compressing the number of channels to 1 using 1x1 convolution and then generating a spatial attention map by sigmoid activation, Step 3.4 involves upsampling the spatial attention map once to align features with F1, and then performing a floating-point multiplication with F1 to generate a feature map F5 containing accurate edge information, which is then used for additional monitoring of the network.

[0016] As an option, the improved ASPP structure includes three parallel branches: a DHDC module branch, an SP branch, and a 1x1 convolutional branch. The DHDC module branch consists of three parallel sets of hybrid dilating convolutional HDC submodules, each set of HDC submodules containing three 3x3 dilating convolutions connected in series, with dilating coefficient combinations of (1,2,3), (2,4,6), and (3,6,9), respectively. The SP branch includes a horizontal branch and a vertical branch, with the horizontal branch responsible for performing adaptive mean pooling in the height dimension, the vertical branch responsible for performing adaptive mean pooling in the width dimension, and the 1x1 convolutional branch responsible for performing channel compression and feature fusion, balancing computational overhead and feature representation capability.

[0017] As an option, the feature fusion execution process using the FCE module in Step 5 is: Step 5.1 involves adjusting the number of channels using 1x1 convolution to align channels and ensure fusion compatibility for F3 and F4. Step 5.2 introduces a channel attention mechanism and assigns adaptive weights to each channel of the feature map, This includes step 5.3, which involves enhancing spatial features using a 3x3 depth separable convolution.

[0018] As an option, the process of further upsampling and cross-layer fusion in step 6 specifically involves progressive upsampling of the fused feature map, inputting the fused feature map into the decoder, gradually expanding the size of the feature map using bilinear interpolation, and complementing spatial details by sequentially fusing it with the F3 and F2 feature maps processed by the FCE module during this process. After the final upsampling to restore it to 512×512 pixels, the semantic category probability of each pixel is output using the Softmax activation function. [Effects of the Invention]

[0019] The present invention provides a method for fine segmentation of small targets that combines improved pyramid pooling and edge branch monitoring. A dataset is selected, the dataset images are preprocessed, and the preprocessed images are further subjected to feature extraction via the backbone network MobilenetV3 to generate four sets of feature maps F1 to F4. F1 and F2 are input into the ABG module to generate a feature map F5 containing edge information. F4 is sent to the improved ASPP structure to obtain a feature map F6 containing multi-scale context information. At the same time, F3 and F4 are input into the FCE module for feature fusion to complement spatial details. Finally, after combining and fusing F5 and F6, enhancement is performed by the FCE module, and by combining it with the monitoring of the feature maps F3 and F4, upsampling is gradually performed to the original size to obtain the final segmented image. The present invention realizes the optimization of the entire link of "feature extraction - edge enhancement - multi-scale modeling - feature calibration - accurate decoding" by integrating the segmentation network.

Brief Description of the Drawings

[0020] To more clearly explain the embodiments of the present invention or the technical solutions of the prior art, the following briefly describes the drawings necessary for the description of the embodiments or the prior art. Of course, the drawings described below are merely examples of the embodiments of the present invention, and those skilled in the art can obtain other drawings based on these drawings without creative effort. [Figure 1] It is a flowchart of a method for fine segmentation of small targets that combines improved pyramid pooling and edge branch monitoring according to the present invention. [Figure 2] It is a structural schematic diagram of the ABG module according to the present invention. [Figure 3] It is a structural comparison diagram of the HDC module and the DHDC module. [Figure 4] It is a structural schematic diagram of the FCE module according to the present invention. [Figure 5] It is a comparison diagram of the visualization of different models on the CamVid dataset in the specific embodiment of the present invention. [Modes for carrying out the invention]

[0021] The embodiments of the present invention will be described in detail below, with specific examples of the embodiments shown in the drawings. Throughout the drawings, the same or similar reference numerals indicate the same or similar elements, or elements having the same or similar functions. The embodiments described below with reference to the drawings are illustrative and intended to interpret the present invention, and should not be understood as limiting the present invention.

[0022] The Chinese meanings of the English abbreviations used in this specification are as follows:

[0023] ABG Module: An edge gating monitoring module based on a warning mechanism.

[0024] Improved ASPP structure: Improved pyramidal pooling structure.

[0025] FCE Module: Feature calibration enhancement module.

[0026] DHDC Module Branching: High-density hybrid expansion convolution module branching.

[0027] SP branch: Strip pooling module branch.

[0028] The present invention is a method for fine-tuning small targets by combining improved pyramid pooling and edge branch monitoring, Step 1 involves selecting and preprocessing the dataset to be split, and fixing the size to 512x512. Step 2 involves inputting a preprocessed dataset into MobilenetV3, extracting feature maps, and obtaining four sets of feature maps F1, F2, F3, and F4. Step 3 involves inputting feature maps F1 and F2 into the ABG module to generate feature map F5, Step 4 involves inputting the feature map F4 into an improved ASPP structure and performing multi-branch fusion to generate the feature map F6, Step 5 involves implementing the FCE module to enhance the semantic features of small goals and optimize feature quality. This provides an improved method for fine-tuning small targets by fusing pyramid pooling and edge branch monitoring, which includes step 6, inputting feature maps F5 and F6 into an FCE module for feature fusion, further performing upsampling and cross-layer fusion, and outputting the partitioning result.

[0029] Further explanation will be provided below, referring to the specific implementation process. Figure 1 is an execution flowchart of the fine-to-small-target subdivision method that combines the improved pyramid pooling and edge branch monitoring described above.

[0030] Step 1: Determine which datasets to split and perform preprocessing operations such as scaling, cropping, and rotation. The purpose is to increase data diversity so that the model can handle more situations and reduce overfitting, while also unifying the dataset size to facilitate model processing. Data preprocessing operations are relatively diverse and there are no particularly strict criteria, but the core principles to follow include not destroying the spatial alignment relationship between images and labels, not losing important semantic information, and adapting to the input requirements of the model.

[0031] Image scaling and cropping are both used to adapt the model's input. Scaling typically operates within a range of 0.5 to 2.0 (i.e., from reducing to 50% of the original size to expanding to 200%), and the aspect ratio must be strictly maintained during the scaling process (to avoid target distortion, such as a car being stretched). If scaling is ineffective for the original image (for example, if the original image is relatively large and a significant amount of information is lost after reduction), cropping is used to obtain a fixed-size input (the model typically requires a fixed H×W input, and the model of this invention uses a 512×512 input). This invention performs cropping for any picture larger than 1024×1024. The purpose of rotation is to improve data diversity and enhance the model's robustness to the target rotation angle. In semantic partitioning, the rotation angle should not be excessively large, as exceeding 15° can cause significant distortion of the target or excessive overlap between labels and image edges. Therefore, the commonly used range is [-10° to 10°] (i.e., rotating by 10° to the left and 10° to the right), and should not exceed [-15° to 15°] even in extreme cases.

[0032] Step 2: The dataset processed in Step 1 is feature-extracted via a backbone network to generate a feature map. The backbone network is typically a convolutional neural network (CNN) based structure that continuously integrates and extracts high-dimensional information from images using multiple convolutional downsamplings, expanding the number of channels in the original 3-channel (R, G, B) image which only has 3-dimensional features, and including semantic information of different dimensions in each channel, with the aim of facilitating computer comprehension. The backbone network used in this invention is MobileNetV3, a lightweight and efficient convolutional neural network proposed by Google in 2019, which can effectively extract image features without imposing excessive computational and memory loads. The input to the original MobileNet is 224×224, but in this invention, it is adjusted to 512×512, maintaining the compression ratio in the extraction process.

[0033] In Step 2, a feature map is output once at each of the four extraction stages. These four sets of feature maps represent different stages in the backbone network's feature extraction process and are named F1, F2, F3, and F4. The size of these four output feature maps decreases continuously, while the number of channels increases continuously. All four stages belong to the bottleneck unit of MobileNetV3. The main operation of the bottleneck unit is to first expand the channels using a 1x1 convolution, then to achieve 1 / 2 downsampling of the feature map using a stride 2 depth separable convolution, and finally to complete the transformation of the feature dimensions by compressing the channels using a 1x1 convolution. The first two stages represent the relatively shallow layers of the backbone network, where the image is compressed to 1 / 4 and 1 / 8 of its original size, retaining much spatial information and suitable for additional spatial monitoring. The latter two stages represent the relatively deep layers, where the image size is compressed to 1 / 16 and 1 / 32 of its original size, containing little spatial information from lower layers, while the number of channels increases to 64 and 160, containing very rich semantic information from higher layers. The model of the present invention aims to perform individual processing on feature maps at different stages, flexibly applying their advantages and maximizing the advantages of the features.

[0034] Characteristics of the four feature maps: The two feature maps, F1 (size 1 / 4 of the original image and relatively few channels) and F2 (size 1 / 8 of the original image and the second fewest channels), focus on low-level spatial information (color, contours), while the two feature maps, F3 (size 1 / 16 of the original image) and F4 (size 1 / 32 of the original image and the most channels), focus on high-level semantic information.

[0035] Step 3: Input the lower-level feature maps F1 and F2 generated in Step 1 into the ABG module. The operation process in the ABG module is as follows:

[0036] Size matching: Using 1x1 convolution, the number of channels in F1 and F2 are adjusted to be the same. Then, downsampling of F1 by a stride of 2 is performed, or upsampling is performed on F2. In this invention, downsampling is performed on F1, so that the size of both images is unified to 1 / 8 of the original image.

[0037] Feature fusion and activation: Combine aligned feature maps and enhance nonlinear representations with the ReLU activation function.

[0038] Attention map generation: The number of channels is compressed to 1 using 1x1 convolution, and a spatial attention map is generated by sigmoid activation (emphasizing the weights of edge regions).

[0039] Edge feature output: The attention map is upsampled once and feature-aligned with F1, and then F1 is multiplied by a floating-point number to generate a feature map F5 containing accurate edge information, which is used for additional network monitoring. The structural diagram of the ABG module is shown in Figure 2.

[0040] Step 4: The deep feature map F4 generated in Step 2 is input into the improved ASPP structure (DS-ASPP), which includes three parallel branches, specifically as follows:

[0041] DHDC Module Branching: Design a high-density hybrid dilating convolution (DHDC) module to replace the fixed-expansion coefficient convolution module in conventional ASPP. Module Structure: Composed of three parallel sets of hybrid dilating convolution HDC submodules, each set of HDC submodules containing three 3x3 dilating convolutions connected in series, with expansion coefficient combinations of (1,2,3), (2,4,6), and (3,6,9), respectively. The core function of such a design is to aggregate multiscale features through high-density residual coupling, suppressing the "grinding effect" while maintaining a large receptive field and enhancing the response to small target features. The structures of HDC and DHDC are shown in Figure 3, respectively.

[0042] The Strip Pooling (SP) module branch replaces global average pooling and designs a horizontal-vertical dual-branch strip pooling. Horizontal branch: Performs height-dimension adaptive average pooling on F4 (compressing to a size of 1×W), performs dimensionality reduction using 1×1 convolution, and then restores to the original input size using bilinear interpolation while preserving spatial information in the width direction. Vertical branch: Performs width-dimension adaptive average pooling on F4 (compressing to a size of H×1), performs dimensionality reduction using 1×1 convolution, and then restores to the original input size using bilinear interpolation while preserving spatial information in the height direction. Fusion output: Adds the feature maps of the horizontal and vertical branches to generate a feature map that includes the global context and does not lose spatial detail.

[0043] 1x1 Convolutional Branch: This method uses 1x1 convolution to perform channel compression and feature fusion on F4, balancing computational overhead and feature representation capabilities.

[0044] Finally, the output features of the DHDC branch, SP branch, and 1x1 convolutional branch are combined and merged to generate a feature map F6 that includes multiscale contextual information.

[0045] Step 5: Before feature fusion, the FCE module is introduced, and the specific processing process is as follows.

[0046] Channel alignment: The number of channels is adjusted using 1x1 convolution for the F3 and F4 feature maps output from the backbone network, as well as the intermediate feature maps from the decoder, to ensure compatibility for fusion.

[0047] Channel attention weighting: Introduce a channel attention mechanism to assign adaptive weights to each channel of the feature map, suppress background noise, and enhance semantic features of smaller targets.

[0048] The channel attention mechanism is a widely used method to improve model performance in convolutional neural networks. By assigning different weights to the channels of each feature map, it emphasizes the channels that contribute most to the task and suppresses irrelevant or redundant channels, thereby improving model performance. The operating principle of the channel attention mechanism involves the following steps: First, the feature map is compressed in spatial dimension using global pooling and average pooling to obtain two sets of 1×1×C (where C represents the number of channels) feature maps. Next, the results of global pooling and average pooling are sent to shared multilayer perceptrons (MLPs) for training to obtain two 1×1×C feature maps. The number of neurons in the first layer of the MLP is C / r (where r is a coefficient to reduce the number of computational parameters; the larger the coefficient, the lower the computational load, but the lower the computational accuracy; in this invention, r=1.5 was selected through experimental verification after balancing computational load and computational accuracy), the activation function is ReLu, and the number of neurons in the second layer is C. Finally, the output results of the MLP are added together, and a reactivation process is performed using the Sigmoid function to obtain the final channel attention weight matrix M.

[0049] The formula for calculating spatial attention is as follows: JPEG0007875569000002.jpg9151

[0050] The channel attention weight matrix M can effectively guide the model to focus on the channels with the highest contribution to the task, thereby improving the model's accuracy.

[0051] Spatial feature enhancement: Use 3x3 depth-unit separable convolution to increase spatial resolution and emphasize boundary information of small targets.

[0052] The structure of the FCE module in Step 5 is shown in Figure 4.

[0053] Step 6: The F5 (edge features) generated in Step 3 and the F6 (multiscale semantic features) generated in Step 4 are input to the FCE module for enhancement, and a fused feature map is obtained. Finally, the fused feature map is subjected to incremental upsampling, and the fused feature map is input to the decoder. The size of the feature map is gradually increased by bilinear interpolation, and spatial details are complemented by sequentially fusing it with the F3 and F2 feature maps processed by the FCE module during this process. After the final upsampling, it is restored to 512×512 pixels, and the semantic category probability of each pixel is output using the Softmax activation function to obtain the partitioning result.

[0054] Furthermore, the present invention will be explained and compared through more specific experiments.

[0055] (1) Experimental environment configuration Due to the complexity of deep learning models, the model training proposed by this invention requires certain environmental and usage conditions to be met, and the specific environmental configuration is as follows.

[0056] a. Hardware environment: Computing device: Standalone NVIDIA GeForce RTX 3090 graphics card; Storage and Memory: ≥32GB system memory, ≥1TB solid-state drive (used for storing datasets and model weights).

[0057] b. Software environment: Operating system: Ubuntu 20.04 LTS; Programming languages and frameworks: Python 3.9, PyTorch 2.0, CUDA 12.1; Dependent libraries: OpenCV (image processing), NumPy (numerical computation), Scikit-learn (metric calculation), TensorBoard (training visualization).

[0058] (2) Details of the experimental dataset The training dataset used in this invention includes the following four datasets and is applicable to most everyday small-goal splitting tasks.

[0059] a. Cityscapes: Includes 5000 images of finely annotated city scenes with a resolution of 2048×1024, covering 30 semantic categories (including small targets such as traffic lights, pedestrians, and bicycles), and is divided into a training set of 2975 images, a validation set of 500 images, and a test set of 1525 images.

[0060] b. CamVid: Contains 701 video frame annotation images of driving scenes, with a resolution of 960x720, covering 32 semantic categories (including small targets such as traffic signs and rod-like objects), and emphasizes temporal consistency between frames.

[0061] c.PASCALVOC2012: Contains 11,530 images of general-purpose scenes, consisting of 1,464 training images and 1,449 validation images, covering 20 foreground categories (including small targets such as small animals and small objects), and offers high scene diversity.

[0062] d.IDDA: Includes over 8000 RGB images of indoor scenes, covers 50 semantic categories (including small targets such as chairs and displays), has multi-viewpoint and multi-lighting variation characteristics, and is applicable to domain-adaptive verification.

[0063] If there are specific partitioning needs (e.g., partitioning of microlesions in the medical field, cell partitioning in the biological field), the model can be trained by selecting other appropriate datasets or by building the datasets from scratch, depending on the requirements of the different tasks.

[0064] (3) Training methods In the model training proposed by this invention, the Adam optimizer is used, the number of training epochs is 240, the batch size is 8, the initial learning rate is 1e-3, and a poly-decay strategy is adopted. The poly-decay strategy is a common learning rate adjustment method and is widely applied in training deep learning models. Its core idea is to achieve rapid convergence and stable optimization during the training process by dynamically adjusting the learning rate using a polynomial decay formula, and the following formula is used. JPEG0007875569000003.jpg15153 Here, lr is the current learning rate, base_lr is the initial learning rate, epoch is the current training epoch, num_epochs is the total number of training epochs, and power is a parameter that controls the shape of the decay curve, which usually has a value range of 0.5 to 2.0. Experimental verification has shown that in this invention, the best performance is obtained when the value is 0.7.

[0065] During the training process, a multi-task coupled loss function is used to monitor and optimize the model, enabling coupled training of the main partitioning task and the edge monitoring task. The loss function is designed as follows:

[0066] The main bifurcation loss uses a combination of cross-entropy loss (weight α=0.7) and Dice loss (weight 1-α=0.3). The cross-entropy loss function aims to bring the predicted probability of each pixel as close as possible to the true probability, while the Dice loss function aims to bring the similarity of the predicted results closer to the similarity of the true labels, i.e., to maximize the overlap area between the predicted results and the true labels. Therefore, using cross-entropy as the loss function reduces the probability of predicting the background as a target class, resulting in a better score on the evaluation metric. On the other hand, the Dice loss function improves the model's ability to split small targets when the background tends not to be predicted as a target. This combination of weights is an optimized assignment obtained through the experimental process of the present invention, effectively balancing the influence of category discrimination ability and the proportion of small target areas, and allows for weight reallocation according to the task emphasis in special situations (e.g., situations where the background does not need to be predicted, or where the dataset to be predicted does not contain small targets).

[0067] JPEG0007875569000004.jpg74170

[0068] Edge branch loss: Using binary cross-entropy (BCE) loss, the learning of the edge feature map output from the ABG module is monitored, and the formula is as follows (N represents the total number of pixels in the image). JPEG0007875569000005.jpg20151

[0069] The formula for total loss is as follows (where λ = 0.3 is the weight for edge monitoring intensity, and the magnitude of the weight can be adjusted according to the task needs in different scenes). JPEG0007875569000006.jpg15154

[0070] After training begins, the model is trained and optimized according to the steps in the proposed technology, ultimately forming a semantic partitioning framework with relatively high accuracy for small targets. The accuracy of the model of the present invention is compared with currently mainstream semantic partitioning networks in the table below (using the partitioning results in CamVid as an example). JPEG0007875569000007.jpg80156

[0071] At the same time, this embodiment also compares different models for partitioning accuracy on small targets, and the results are shown in the table below (using the CamVid dataset as an example). JPEG0007875569000008.jpg93169

[0072] To more clearly demonstrate the accuracy advantage of the method proposed by the present invention in the two tables above, we visualized several representative examples using the CamVid dataset, and the results are shown in Figure 5.

[0073] As can be seen in Figure 5, the first column of the figure is the original image, the second column is the processing result of comparison model 1 (APFormer, 2024), the third column is the processing result of comparison model 2 (Sigma, 2025), and the fourth column is the processing result of the present invention method. The present invention has superior effects compared to the other two models in terms of boundary accuracy and the localization and modeling of small targets.

[0074] In short, the present invention has the following beneficial effects.

[0075] 1. Significant improvement in the accuracy of small target segmentation: The combination of multiscale dilating convolution and high-density residual concatenation of the DHDC module solves the problem of conventional ASPP being insufficient in modeling small targets. On datasets such as Cityscapes and CamVid, the mIoU (mean crossed over union) of small target categories (traffic signs, pedestrians, bicycles, etc.) is improved by 4% to 6% compared to some current mainstream methods.

[0076] 2. More accurate boundary segmentation: The ABG module enhances edge feature extraction of small targets through edge monitoring and attention mechanisms, and when combined with noise suppression by the FCE module, the clarity of the segmentation boundaries is significantly better than existing models, and there is no blurring or cracking of edge details in the visualization results.

[0077] 3. Lightweight and easy-to-deploy model: Adopting a design that combines a MobileNetV3 backbone network with depth-separable convolution, the number of parameters is only 12.8M and the number of FLOPs (floating-point operations) is 27.1G, making it lighter than most mainstream methods currently available and meeting the deployment needs of mobile devices or real-time systems.

[0078] 4. Strong resistance to overfitting: Multidimensional data enhancement, edge branch monitoring, and a composite loss function effectively mitigate the problem of overfitting due to the low pixel occupancy of small targets, resulting in stable generalization performance of the model across different datasets.

[0079] 5. Wide applicability to multiple scenes: It has achieved improved mIoU in multiple datasets, including urban driving (Cityscapes, CamVid), general-purpose scenes (PASCALVOC 2012), and indoor scenes (IDDA), and can be widely applied in fields such as autonomous driving, medical image analysis, and remote sensing image processing.

[0080] The information disclosed herein represents only preferred embodiments of the present invention and does not limit the scope of the claims. Those skilled in the art can understand and implement all or part of the processes of the above embodiments, and equivalent modifications made based on the claims of the present invention remain within the scope of protection of the present invention.

Claims

1. A method for fine-tuning small targets that combines improved pyramid pooling and edge branch monitoring, Step 1 involves selecting and preprocessing the dataset to be split, and fixing the size to 512x512. Step 2 involves inputting a preprocessed dataset into Mobilenet V3, extracting feature maps, and obtaining four sets of feature maps F1, F2, F3, and F4. Step 3 involves inputting feature maps F1 and F2 into the ABG module to generate feature map F5, Step 4 involves inputting the feature map F4 into an improved ASPP structure and performing multi-branch fusion to generate the feature map F6, Step 5 involves introducing the FCE module to enhance the semantic features of small goals and optimize feature quality. An improved method for fine-tuning small targets by integrating pyramid pooling and edge branch monitoring, characterized by comprising: step 6, inputting feature maps F5 and F6 into an FCE module to perform feature fusion, further performing upsampling and cross-layer fusion, and outputting the partitioning result.

2. The improved pyramid pooling and edge branching monitoring fusion method for fine-tuning small targets according to claim 1, characterized in that the core principles to be followed during preprocessing in step 1 include not destroying the spatial alignment relationship between the image and the label, not losing important semantic information, and adapting to the input requirements of the model.

3. In step 2, the four feature maps F1, F2, F3, and F4 each represent different stages in the feature extraction process of the backbone network MobileNetV3, each belonging to a bottleneck unit of MobileNetV3, the size of the four feature maps decreases continuously, the number of channels increases continuously, the size of F1 is 1 / 4 of the original image, the size of F2 is 1 / 8 of the original image, the size of F3 is 1 / 16 of the original image, and the size of F4 is 1 / 32 of the original image, characterized in that, in step 2, the improved pyramid pooling and edge branch monitoring fused fine-to-fracture method for small targets according to claim 2.

4. The execution process in step 3 is as follows: Step 3.1 involves adjusting the size to match, that is, adjusting the number of channels in F1 and F2 to match using 1x1 convolution, and then performing downsampling with a stride of 2 for F1 or upsampling for F2. Step 3.2 involves combining the aligned feature maps and enhancing the nonlinear representation using the ReLU activation function, Step 3.3 involves compressing the number of channels to 1 using 1x1 convolution and then generating a spatial attention map by sigmoid activation, The improved method for fine-tuning small targets by combining pyramid pooling and edge branch monitoring according to claim 3, comprising step 3.4, which includes: upsampling the spatial attention map once and aligning it with features F1, and then performing a floating-point multiplication with F1 to generate a feature map F5 containing accurate edge information, F5 of which is used for additional monitoring of the network.

5. The improved ASPP structure includes three parallel branches: a DHDC module branch, an SP branch, and a 1x1 convolutional branch, the DHDC module branch consisting of three parallel sets of hybrid dilating convolutional HDC submodules, each set of HDC submodules containing three 3x3 dilating convolutions connected in series, with dilating coefficient combinations being (1,2,3), (2,4,6), and (3,6,9), respectively; the SP branch includes a horizontal branch and a vertical branch, the horizontal branch responsible for performing adaptive mean pooling in the height dimension, the vertical branch responsible for performing adaptive mean pooling in the width dimension, and the 1x1 convolutional branch responsible for performing channel compression and feature fusion, balancing computational overhead and feature representation capability, as described in 4, an improved pyramid pooling and edge branch monitoring fine-tuning method for small targets.

6. The feature fusion execution process by the FCE module in Step 5 is as follows: Step 5.1 involves adjusting the number of channels using 1x1 convolution for F3 and F4 to ensure compatibility of the fusion, Step 5.2 involves introducing a channel attention mechanism and assigning adaptive weights to each channel of the feature map, The improved method for fine-tuning a small target by integrating pyramid pooling and edge branch monitoring according to claim 5, comprising step 5.3, which involves enhancing spatial features using a 3x3 depth separable convolution.

7. The process of further upsampling and cross-layer fusion in step 6 includes the steps of progressively upsampling the fused feature map, inputting the fused feature map to a decoder, and gradually expanding the size of the feature map by bilinear interpolation, wherein spatial details are complemented by sequentially fusing with the F3 and F2 feature maps processed by the FCE module in the said steps, and after being restored to 512 × 512 pixels after the final upsampling, the semantic category probability of each pixel is output using the Softmax activation function, characterized in that the improved pyramid pooling and edge branch monitoring fusion method for fine-tuning small targets according to claim 6.