Corn disease detection method based on lightweight transformer
By combining a lightweight SwinTransformer network and a feature pyramid network, the deployment problem of the maize leaf lesion detection model on computing-limited devices was solved, the detection accuracy of small-scale lesions was improved, and efficient maize disease detection was achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- JILIN UNIVERSITY
- Filing Date
- 2026-04-14
- Publication Date
- 2026-06-16
AI Technical Summary
In existing technologies, maize leaf lesion detection models have difficulty capturing the long-distance dependencies of small lesions in global feature modeling. They involve huge computational loads and redundant parameters, making them difficult to deploy on devices with limited computing power, such as drones. Furthermore, they are prone to losing small target features in complex farmland backgrounds, resulting in insufficient detection accuracy.
A lightweight SwinTransformer network is used for feature extraction. Combined with a feature pyramid network and a dynamic detection head, the model parameters are optimized through multi-scale feature fusion and a weighted bounding box loss function to generate a high-precision maize disease detection model.
It improves the accuracy of small-scale lesion identification to over 85%, achieving high-precision and rapid detection in corn disease detection, and is suitable for deployment on edge devices.
Smart Images

Figure CN122023784B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the fields of smart agriculture and plant protection technology, and in particular to a method for detecting maize diseases based on a lightweight Transformer. Background Technology
[0002] As an important food crop and feed source in my country, corn is severely affected by pests and diseases (such as rust, leaf spot, and aphids) during its growth process, significantly impacting yield and quality. The main technical challenges in existing corn pest and disease control practices are as follows:
[0003] Maize leaf lesions are typically small (area <32×32 pixels). CNN models have limitations in global feature modeling, making it difficult to effectively capture the long-distance dependencies of small lesions. While the standard Vision Transformer has global modeling capabilities, it is computationally intensive and has high parameter redundancy, making it difficult to deploy on edge devices with limited computing power, such as drones. In complex farmland environments (light variations, occlusion), existing models are prone to losing small target features, leading to insufficient detection accuracy. Feature Pyramid Network (FPN) uses standard interpolation upsampling, which struggles to fully preserve the detailed information of small targets when fusing multi-scale features, resulting in a decrease in the feature representation ability of small targets during multi-scale fusion. Furthermore, standard detection heads have limited adaptability to targets of different scales, making it difficult to specifically enhance the detection ability of small targets. In addition, small bounding box samples are ignored by the standard loss function during training due to their small pixel contribution, causing the model optimization focus to favor large targets, further weakening the detection accuracy of small targets.
[0004] Therefore, there is an urgent need for a corn disease detection method based on lightweight Transformer. Summary of the Invention
[0005] This invention provides a method for detecting corn diseases based on a lightweight Transformer, in order to solve the above-mentioned problems existing in the prior art.
[0006] To achieve the above objectives, the present invention provides the following technical solution:
[0007] A method for detecting maize diseases based on a lightweight Transformer includes:
[0008] S1: Input the corn image into the feature extraction module with SwinTransformer as the backbone network, and generate multi-level feature representations through multi-level feature extraction;
[0009] S2: Based on multi-level feature representation, multi-scale feature fusion is performed through a feature pyramid network to generate fused multi-scale features;
[0010] S3: Based on the fused multi-scale features, attention processing is performed on the feature maps of each resolution level through a dynamic detection head to generate classification scores and bounding box prediction results for each resolution level;
[0011] S4: Based on the classification score and bounding box prediction results, the training loss is calculated using the weighted bounding box loss function. A weight coefficient higher than the standard weight is applied to bounding boxes with an area lower than the preset threshold. The model parameters are updated through backpropagation to generate the trained detection model.
[0012] S5: Input the maize image to be detected into the trained detection model and perform forward propagation. Perform non-maximum suppression processing on the output bounding box prediction results to generate the detection location and category results of maize diseases.
[0013] Furthermore, step S1 includes:
[0014] S11: Divide the input corn image into non-overlapping image blocks. After flattening each image block, map it to the initial number of channels through a linear projection layer to generate the initial feature representation.
[0015] S12: Input the initial feature representation into multiple stacked SwinTransformerBlocks. Each SwinTransformerBlock uses ordinary window attention and shifting window attention alternately for feature extraction. The feature is downsampled step by step through the PatchMerging layer to build a multi-scale feature hierarchy with the number of channels doubling step by step and the spatial resolution halving step by step, generating a multi-level feature representation.
[0016] Furthermore, step S2 includes:
[0017] S21: Receive multi-level feature representation, which includes multiple feature levels with the number of channels doubling and the spatial resolution halving at each level. Through the neck module of the feature pyramid network, channel unification and scale alignment of each level of features are performed to generate a multi-scale feature list with a unified number of channels, including high-resolution features from shallow networks and low-resolution features from deep networks.
[0018] S22: In the top-down path of the feature pyramid network, content-aware upsampling is performed on the high-resolution feature map. An adaptive recombination kernel is generated through kernel prediction. The high-resolution feature map is then weighted and recombined based on the adaptive recombination kernel to generate the upsampled high-resolution feature map. The original upsampling method is kept unchanged for the low-resolution feature map.
[0019] S23: The upsampled high-resolution feature map is fused with the corresponding low-resolution feature map, and multi-scale features are generated by fusion at each level.
[0020] Furthermore, the PatchMerging layer in step S12 also includes multi-scale feature enhancement processing, which includes:
[0021] S121: The PatchMerging layer performs 2×2 window merging and channel concatenation on the input features to generate intermediate features with 4C channels and a spatial resolution of H / 2×W / 2. The intermediate features are then input into the multi-scale feature enhancement module, where C represents the number of channels, H represents the feature map height, and W represents the feature map width.
[0022] S122: The multi-scale feature enhancement module uses three parallel convolutions with kernel sizes of 3×3, 5×5, and 7×7 to extract features from the intermediate feature tensor, capturing feature information within different receptive fields and generating three feature maps.
[0023] S123: The three feature maps are fused through 1×1 convolution to generate a fused feature representation;
[0024] S124: Input the fused feature representation into the squeeze excitation channel attention module, enhance the feature response of small target diseases through channel-level weighting, and generate an enhanced feature representation; reduce the number of channels from 4C to 2C through the dimension reduction mapping layer to generate a part of the enhanced multi-level feature representation, and pass it into the subsequent SwinTransformerBlock.
[0025] Furthermore, in step S12, after the attention calculation, the SwinTransformerBlock also includes cross-scale self-attention processing. Specifically, in the stacked SwinTransformerBlock sequence, cross-scale self-attention processing is applied to every other block. The cross-scale self-attention processing includes:
[0026] S125: Receive the output features calculated by window attention W-MSA or shifted window attention SW-MSA in the current SwinTransformerBlock and input them into the cross-scale self-attention module;
[0027] S126: The cross-scale self-attention module computes attention features of two different window sizes in parallel, capturing contextual information within the local window and across the window respectively, and generating two attention feature maps;
[0028] S127: Input the two attention feature maps into the fusion layer to integrate cross-scale features and generate cross-scale fused features;
[0029] S128: Input the cross-scale fusion features into the squeeze excitation channel attention module for channel-level weighting to generate enhanced cross-scale attention features, which are then passed to the next layer as the output of the current SwinTransformerBlock.
[0030] Furthermore, step S22 includes:
[0031] S221: Perform channel compression convolution on the high-resolution feature map to generate compressed features for kernel prediction;
[0032] S222: Based on compressed features, an adaptive recombination kernel is generated through a convolutional layer. The adaptive recombination kernel is normalized and reshaped into a kernel tensor corresponding to the spatial location of the feature map.
[0033] S223: Perform preliminary upsampling on the high-resolution feature map to generate a preliminary upsampled feature map;
[0034] S224: Based on the normalized adaptive recombination kernel, the initial upsampled feature map is recombined position by position to generate a content-aware upsampled feature map;
[0035] S225: Perform a post-processing convolution operation on the content-aware upsampled feature map to generate an upsampled feature map that matches the size of the corresponding low-resolution feature map from the deep network.
[0036] Furthermore, step S3 includes:
[0037] S31: Receive the fused multi-scale feature list, which contains multiple feature maps of different resolutions, and independently perform three-level attention chain processing of scale attention, spatial attention and task attention on the feature map resolution level of each resolution level in the list.
[0038] S32: Apply a scale attention mechanism to the feature map of each resolution level to dynamically adjust the response weights of features at different resolutions and generate a scale-weighted feature map.
[0039] S33: Based on scale-weighted feature maps, spatial locations are weighted by a 3×3 convolutional spatial attention layer to generate refined spatial feature maps;
[0040] S34: Based on spatially refined feature maps, the spatial dimension is compressed through global average pooling, and then two levels of 1×1 convolutional layers and ReLU activation function are used to generate task-level feature representations.
[0041] S35: Based on task-level feature representation, classification scores and bounding box prediction results for each resolution level are generated using the classification head and regression head, respectively.
[0042] Furthermore, step S4 includes:
[0043] S41: Calculate the area value of each predicted bounding box in the current training batch;
[0044] S42: Compare the area value of the actual labeled bounding box with the preset area threshold. Mark the bounding box with an area value less than the area threshold as a small target bounding box, and mark the actual labeled bounding box with an area value greater than or equal to the area threshold as a standard weighted bounding box.
[0045] S43: Multiply the regression loss of the predicted bounding box that is actually labeled as a small target bounding box by a preset weight coefficient to generate a weighted bounding box regression loss; keep the weight coefficient of the regression loss of the predicted bounding box that is actually labeled as a standard weighted bounding box at 1.0 to generate a standard bounding box regression loss.
[0046] S44: Summing the weighted bounding box regression loss with the standard bounding box regression loss generates the overall weighted bounding box regression loss, which together with the classification loss constitutes the overall training loss;
[0047] S45: Perform backpropagation based on the overall training loss, update the model parameters, and repeat S41-S45 until the model converges, generating the trained detection model.
[0048] Furthermore, step S5 includes:
[0049] S51: Decode the bounding box prediction results at each resolution level, convert the prediction results into bounding box coordinates in the image coordinate system, and generate a set of candidate detection boxes;
[0050] S52: Sort the candidate detection box set in descending order of classification score, select the detection box with the highest classification score as the reference box, perform non-maximum suppression processing, remove overlapping detection boxes with the reference box whose intersection-union ratio exceeds the preset intersection-union ratio threshold, repeat the process of step S52 until all candidate detection boxes are processed, and generate the final detection box set.
[0051] S53: Based on the final set of detection boxes, output the disease category label, bounding box coordinates and confidence score corresponding to each detection box to complete the detection and location of corn diseases.
[0052] Furthermore, the preset area threshold is 32×32 pixels, and the preset weight coefficient is 2.0. Compared with the prior art, the present invention has the following advantages:
[0053] This application proposes a lightweight Transformer-based method for detecting maize diseases. Based on the traditional swinTransform model, this method innovates and optimizes the model's structure to address the theoretical limitations of traditional Convolutional Neural Networks (CNNs) in areas such as long-distance dependency modeling and small target detection accuracy. This improves the accuracy of small-scale lesion recognition to over 85%, thereby achieving high-precision and rapid detection of maize diseases. This provides significant assistance in maize disease control and fills the research gap in the field of maize pest and disease detection using improved models based on visual Transformers. Attached Figure Description
[0054] The accompanying drawings are provided to further illustrate the invention and form part of the specification. They are used in conjunction with embodiments of the invention to explain the invention and do not constitute a limitation thereof. In the drawings:
[0055] Figure 1 This is a flowchart of a corn disease detection method based on a lightweight Transformer in an embodiment of the present invention;
[0056] Figure 2 This is a diagram of the SwinTransformer architecture in an embodiment of the present invention;
[0057] Figure 3 This is a diagram of two consecutive SwingTransformer blocks in an embodiment of the present invention;
[0058] Figure 4 This is a schematic diagram of the CGA module in an embodiment of the present invention. Detailed Implementation
[0059] The preferred embodiments of the present invention will be described below with reference to the accompanying drawings. It should be understood that the preferred embodiments described herein are for illustration and explanation only and are not intended to limit the present invention.
[0060] The embodiments of the present invention provide, as follows Figure 1 As shown, a method for detecting maize diseases based on a lightweight Transformer includes:
[0061] S1: Input the corn image into the feature extraction module with SwinTransformer as the backbone network, and generate multi-level feature representations through multi-level feature extraction;
[0062] S2: Based on multi-level feature representation, multi-scale feature fusion is performed through a feature pyramid network to generate fused multi-scale features;
[0063] S3: Based on the fused multi-scale features, attention processing is performed on the feature maps of each resolution level through a dynamic detection head to generate classification scores and bounding box prediction results for each resolution level;
[0064] S4: Based on the classification score and bounding box prediction results, the training loss is calculated using the weighted bounding box loss function. A weight coefficient higher than the standard weight is applied to bounding boxes with an area lower than the preset threshold. The model parameters are updated through backpropagation to generate the trained detection model.
[0065] S5: Input the maize image to be detected into the trained detection model and perform forward propagation. Perform non-maximum suppression processing on the output bounding box prediction results to generate the detection location and category results of maize diseases.
[0066] The following is a detailed description with reference to specific embodiments.
[0067] This embodiment provides a method for detecting maize diseases based on a lightweight Transformer, including steps S1 to S5, as follows:
[0068] Step S1: Input the corn image into the feature extraction module with SwinTransformer as the backbone network, and generate multi-level feature representations through multi-level feature extraction.
[0069] like Figure 2 As shown, the overall architecture of the SwinTransformer backbone network is as follows: The input maize disease image first undergoes the PatchPartition operation, which divides the image into 4×4 non-overlapping image blocks. Each image block is flattened to obtain a 48-dimensional vector (4×4×3=48). Then, the 48-dimensional vector is mapped to the initial number of channels C of the model through a linear projection layer to generate the initial feature representation with a spatial resolution of H / 4×W / 4.
[0070] The initial feature representation is sequentially input into a stacked SwinTransformerBlock across four stages. At the end of each stage, a PatchMerging layer performs downsampling: concatenating features from 2×2 adjacent locations expands the number of channels from C to 4C, then reduces it to 2C via linear projection, halving the spatial resolution. The number of channels in the four stages are C, 2C, 4C, and 8C, respectively, corresponding to spatial resolutions of H / 4, H / 8, H / 16, and H / 32, thus constructing a multi-scale feature hierarchy and generating multi-level feature representations suitable for downstream dense prediction tasks, with computational complexity linearly related to image size.
[0071] like Figure 3As shown, each SwingTransformerBlock uses both regular window attention (W-MSA) and shifted window attention (SW-MSA) alternately for feature extraction. W-MSA divides the feature map into non-overlapping local windows, independently computing multi-head self-attention within each window; SW-MSA, based on W-MSA, shifts the window size by half, establishing cross-window connections between adjacent windows. Two consecutive SwingTransformerBlocks alternate between W-MSA and SW-MSA, achieving cross-window global context modeling while maintaining local computational efficiency, effectively capturing long-distance dependencies in maize leaf lesions.
[0072] Step S1 specifically includes:
[0073] S11: Divide the input corn image into non-overlapping image blocks. After flattening each image block, map it to the initial number of channels C through a linear projection layer to generate the initial feature representation.
[0074] S12: Input the initial feature representation into multiple stacked SwinTransformerBlocks. Each SwinTransformerBlock uses ordinary window attention (W-MSA) and shifted window attention (SW-MSA) alternately for feature extraction. The feature is downsampled step by step through the PatchMerging layer to build a multi-scale feature hierarchy with the number of channels doubling step by step and the spatial resolution halving step by step, generating a multi-level feature representation.
[0075] In step S12, the PatchMerging layer also includes multi-scale feature enhancement processing, specifically including:
[0076] S121: Input the feature tensor of shape [B, 4C, H / 2, W / 2] output from the PatchMerging layer into the Multi-Scale Feature Enhancement (MSFE) module. Here, B is the batch size, C is the number of input channels in the current stage, and H / 2 and W / 2 are the height and width of the downsampled feature map, respectively.
[0077] S122: The MSFE module uses three parallel convolutions with kernel sizes of 3×3, 5×5, and 7×7 to extract features from the input feature tensor. All three convolutions maintain the number of input and output channels at 4C and use appropriate padding to keep the feature map space size unchanged. They capture feature information within the small receptive field (3×3, corresponding to the edge of small lesions), the medium receptive field (5×5, corresponding to medium-scale lesion regions), and the large receptive field (7×7, corresponding to larger lesions and their context) to generate three feature maps.
[0078] S123: After concatenating the three feature maps along the channel dimension, channel fusion is performed through 1×1 convolution, compressing the number of channels from 12C back to 4C, generating a fused feature representation, and achieving efficient integration of multi-scale features.
[0079] S124: The fused feature representation is input into the squeeze-and-excitation (SE) channel attention module. The spatial dimension is squeezed through global average pooling, and then channel-level attention weights are generated through two fully connected layers to perform channel-level weighting on the fused feature representation, thereby enhancing the feature response to small target diseases on maize leaves (such as small lesions and aphid bodies) and generating an enhanced feature representation. The enhanced feature representation is then passed through a dimension reduction mapping layer to reduce the number of channels from 4C to 2C, and converted into a sequence of [B,H / 2×W / 2,2C] to be passed into the subsequent SwinTransformerBlock, ensuring that the output dimension is compatible with the SwinTransformer hierarchical structure.
[0080] like Figure 4 As shown, by using multiple independent attention heads, the model can capture different semantic relationships from different subspaces, with each head having different dimensions. Finally, the data is stitched together and linearly projected to fuse the multi-head information. Figure 4 In this context, Head 1, Head 2, and Head 3 refer to multiple attention heads that operate in parallel. Each attention head independently performs query-key-value mapping and attention computation, thereby extracting diverse feature representations of the sequence from different subspaces. Token refers to the basic unit (such as word, subword, or feature vector) in the input sequence, which serves as the carrier of query, key, and value in attention computation, and realizes the information transfer between units through token interaction.
[0081] In step S12, SwinTransformerBlock also includes cross-scale self-attention processing after attention computation, specifically including:
[0082] S125: Receives the output features of the attention computation (W-MSA or SW-MSA) in the current odd-indexed SwinTransformerBlock and inputs them into the Cross-Scale Self-Attention (CSSA) module. CSSA is applied only in the odd-indexed SwinTransformerBlock to control computational overhead and maintain overall computational efficiency.
[0083] S126: The CSSA module performs parallel computation of attention features for two different window sizes (e.g., 7×7 and 14×14): The small window (7×7) captures fine texture information within the local window, corresponding to the detailed features of corn lesions; the large window (14×14) captures global context information across the window, corresponding to the relationship between lesions and surrounding leaf areas; two attention feature maps are generated respectively.
[0084] S127: Input the two attention feature maps into the fusion layer for cross-scale feature integration. The fusion layer merges the two feature maps in the channel dimension through a splicing operation, and then compresses the number of channels back to the original dimension through a 1×1 convolution to generate cross-scale fused features, which makes up for the shortcomings of the original SwingTransformer local window attention (W-MSA / SW-MSA) in global context modeling.
[0085] S128: Input the cross-scale fused features into the SE channel attention module for channel-level weighting, reuse the SEAttention module with the same structure as the MSFE module to further enhance the fused feature representation, generate enhanced cross-scale attention features, and pass them as the output of the current SwinTransformerBlock to the next layer.
[0086] Step S2: Based on multi-level feature representation, multi-scale feature fusion is performed through feature pyramid network to generate fused multi-scale features.
[0087] Step S2 specifically includes:
[0088] S21: Receive the multi-level feature representation generated by S1, perform channel unification and scale alignment on each level of features through the neck module of the Feature Pyramid Network (FPN), and unify the number of channels of each level of feature map to a preset value (such as 256) through 1×1 convolution, generating a multi-scale feature list with a unified number of channels, with each element corresponding to a resolution level.
[0089] S22: In the top-down path of FPN, content-aware reassembly of features (CARAFE) is performed on high-resolution feature maps from shallow networks (feature maps with larger spatial resolution and containing details of small targets such as tiny lesions of corn diseases) to enhance the preservation of details of small targets; the original bilinear interpolation upsampling method remains unchanged for low-resolution feature maps from deep networks. The CARAFE operation specifically includes:
[0090] S221: Perform channel compression convolution on the high-resolution feature map from the shallow network. The number of channels is compressed to a smaller dimension through 1×1 convolution, generating compressed features for kernel prediction and reducing the computational cost of subsequent kernel prediction.
[0091] S222: Based on compressed features, an adaptive recombination kernel is generated through a convolutional layer (with an upsampling kernel size of 5×5). The adaptive recombination kernel is normalized using Softmax to ensure that the sum of the recombination weights is 1. It is then reshaped into a kernel tensor corresponding to the spatial location of the feature map, with a kernel size of 3×3 (recombination kernel size) and an upsampling factor of 2.
[0092] S223: Perform preliminary upsampling on the high-resolution feature map from the shallow network (using nearest neighbor interpolation) to double the spatial resolution and generate a preliminary upsampled feature map.
[0093] S224: Based on the normalized adaptive recombination kernel, the initial upsampled feature map is recombined positionally with weighted reassembly. Specifically, for each spatial location, the features of its neighboring regions are weighted and summed using the corresponding recombination kernel to generate a content-aware upsampled feature map. This operation enables the upsampling process to adaptively adjust the recombination weights according to the content of the input features, aggregating contextual information within a large receptive field, improving the semantic representation and spatial accuracy of the features. Compared to fixed-weight bilinear interpolation, it is more suitable for preserving small-target disease features against the complex background of maize leaves.
[0094] S225: Perform a post-processing convolution operation (1×1 convolution) on the content-aware upsampled feature map, adjust the number of channels, and generate an upsampled feature map that matches the size of the corresponding low-resolution feature map from the deep network.
[0095] S23: The upsampled high-resolution feature map from the shallow network generated in S22 is fused with the low-resolution feature map from the deep network at the corresponding level to generate fused multi-scale features. The output is a multi-scale feature list (feats), where each element corresponds to a fused feature map at a resolution level for use by the subsequent DynamicHead detection head.
[0096] Step S3: Based on the fused multi-scale features, attention processing is performed on the feature maps of each resolution level through a dynamic detection head to generate classification scores and bounding box prediction results for each resolution level.
[0097] This invention replaces the original detection head in the existing SwinTransformer detection framework with a DynamicHead, which consists of an initialization part and a forward propagation part. The initialization part includes scale attention, spatial attention, task attention, a classification head, and a regression head; the forward propagation part is divided into three stages: input, processing, and output. The processing stage sequentially includes applying scale attention, applying spatial attention, calculating task attention, generating classification scores, and generating bounding box predictions.
[0098] The multi-scale output of the SwinTransformer backbone network is upsampled, downsampled, and channel unified by the Neck module, resulting in a multi-scale feature list (feats) with a unified number of channels. The forward method of DynamicHead directly receives this feature list, with each element corresponding to a resolution level. In the MMDetection configuration framework, end-to-end feature extraction and detection are achieved through configuration files.
[0099] Step S3 specifically includes:
[0100] S31: Receives the fused multi-scale feature list (feats) generated by S2, and independently performs three-level attention chain processing (scale attention, spatial attention, and task attention) on the feature map of each resolution level in the list. The multi-scale mechanism is implemented by iterating through the features in the forward loop. Each feature passes through the attention chain and head module independently, and the output is also a per-level list, which facilitates subsequent anchor box generation and non-maximum suppression (NMS) processing.
[0101] S32: Apply a scale attention mechanism to the feature map of each resolution level to dynamically adjust the response weights of features at different scales, enabling the model to adaptively focus on resolution levels that are more discriminative for corn disease detection and generate scale-weighted feature maps.
[0102] S33: Based on scale-weighted feature maps, spatial locations are weighted through spatial attention convolutional layers (3×3 convolution, padding=1 to keep the feature map size unchanged), capturing local spatial context (receptive field is 3×3). The weighting of important pixels is learned through convolutional weights, and spatial features are further refined after scale attention. This helps to suppress background noise in complex backgrounds of cornfields (light changes, leaf occlusion) and highlight disease target areas, generating a refined spatial feature map.
[0103] S34: Based on the spatially refined feature map, global average pooling is used to compress the spatial dimension to obtain a channel-level global representation; then, two 1×1 convolutional layers are applied sequentially (the first layer reduces the dimension to...). ReLU activation; second-level mapping to (Dimension) Generate task-level feature representations and learn channel weights related to classification tasks.
[0104] S35: Based on task-level feature representation, classification scores and bounding box prediction results at each resolution level are generated through the classification head and regression head respectively. The classification head outputs the confidence scores of each category (rust, large spot disease, aphids, etc.), and the regression head outputs the position coordinate offset of the bounding box.
[0105] Step S4: Based on the classification score and bounding box prediction results, the training loss is calculated using the weighted bounding box loss function. Weight coefficients higher than the standard weights are applied to bounding boxes with areas below a preset threshold. The model parameters are updated through backpropagation to generate the trained detection model.
[0106] Standard loss functions (such as L1Loss) often ignore small bounding boxes because their pixel contribution is small, causing the model to favor the optimization of large objects. This invention applies higher weights to small target bounding boxes through a weighted bounding box loss function, amplifying their contribution to the total loss. This forces the model to focus more on optimizing small target diseases (such as tiny lesions and aphid bodies) during training, reducing the impact of object size imbalance.
[0107] Step S4 specifically includes:
[0108] S41: Calculate the area of each predicted bounding box in the current training batch.
[0109] S42: Compare the area value of the true labeled bounding box with the preset area threshold (default is 32×32=1024 square pixels). True labeled bounding boxes with an area value less than the area threshold are marked as small target bounding boxes, and true labeled bounding boxes with an area value greater than or equal to the area threshold are marked as standard weighted bounding boxes.
[0110] S43: Multiply the regression loss of the predicted bounding box labeled as a small target bounding box by a preset weight coefficient (default is 2.0) to generate a weighted bounding box regression loss; keep the weight coefficient of the regression loss of the predicted bounding box labeled as a standard weighted bounding box at 1.0 to generate a standard bounding box regression loss. By applying a high weight (default 2.0) to the prediction results of the real small targets, the regression error of the small targets is amplified, forcing the model to prioritize optimizing the localization accuracy of the real small targets. In actual training, the training logs show an increase in the proportion of small target loss, and the bounding box regression of small targets is more accurate after convergence.
[0111] S44: Summing the weighted bounding box regression loss with the standard bounding box regression loss generates the overall weighted bounding box regression loss, which together with the classification loss constitutes the overall training loss, balancing the training focus and avoiding the model from overfitting large objects.
[0112] S45: Perform backpropagation based on the overall training loss, update all model parameters through gradient descent algorithm, stop iteration according to the preset training epoch or when the average accuracy (mAP) on the validation set no longer improves significantly within a certain number of consecutive epochs, and generate the trained detection model.
[0113] Step S5: Input the maize image to be detected into the trained detection model and perform forward propagation. Perform non-maximum suppression processing on the output bounding box prediction results to generate the detection location and category results of maize diseases.
[0114] Step S5 specifically includes:
[0115] S51: The corn image to be detected is sequentially processed by the SwinTransformer backbone network feature extraction described in S1, the FPN multi-scale feature fusion described in S2, and the DynamicHead detection head processing described in S3. A complete forward propagation is performed, the bounding box prediction results at each resolution level are decoded, and the predicted position offset is combined with the preset anchor box to convert it into the absolute bounding box coordinates in the image coordinate system, generating a set of candidate detection boxes.
[0116] S52: Sort the candidate detection box set in descending order of classification score and perform non-maximum suppression (NMS) processing: retain the detection box with the highest classification score, calculate the intersection-over-union (IoU) ratio of the remaining detection boxes with the highest IoU ratio, remove overlapping detection boxes with an IoU exceeding a preset IoU threshold (e.g., 0.5), repeat the above process until all candidate boxes are processed, generate the final detection box set, and eliminate duplicate detection of the same disease target.
[0117] S53: Based on the final set of detection boxes, output the disease category label (such as rust, large leaf spot, aphids, etc.) and bounding box coordinates for each detection box. Based on confidence scores, we can detect and locate corn diseases, providing accurate spatial location information and category judgment basis for the prevention and control of agricultural pests and diseases.
[0118] In this embodiment, the above modules are integrated end-to-end through configuration files under the MMDetection framework: the multi-scale output of the SwinTransformer backbone network is converted into a multi-scale feature list with a unified number of channels by the FPN neck module with integrated CARAFE upsampling, and then directly used as the input of DynamicHead. The whole process realizes end-to-end feature extraction and disease detection.
[0119] Through the synergistic effect of the above technical solutions, the present invention achieves the following technical effects: DynamicHead's multi-scale, spatial, and task-level attention mechanism enhances the detection capability for small-target pests and diseases, and is expected to improve the average accuracy of small targets. Improvements of 5% to 8%; the application of the CARAFE module in the top-down path of FPN improves the semantic consistency of feature maps, enhances multi-scale feature fusion performance, and further improves... Approximately 5% to 8%; the MSFE module enhances the feature representation of small objects through multi-scale convolutions (3×3, 5×5, 7×7) and SE channel attention, with a computational cost increase of only about 5% to 10% in floating-point operations (FLOPs), making it suitable for deployment on edge devices such as GPUs and drones; the CSSA module compensates for the shortcomings of the original SwinTransformer's local window attention in global context modeling, enhancing the fusion capability of cross-scale features; the weighted bounding box loss function applies a 2.0x weight to the bounding boxes of small objects, expecting an average accuracy of ( This can improve accuracy by 1% to 5%, and the overall average precision (mAP) is also improved accordingly, ultimately increasing the accuracy of small-scale lesion identification to over 85%.
[0120] Obviously, those skilled in the art can make various modifications and variations to this invention without departing from the spirit and scope of this invention.
Claims
1. A method for detecting maize diseases based on a lightweight Transformer, characterized in that, include: S1: Input the corn image into the feature extraction module with SwinTransformer as the backbone network, and generate multi-level feature representations through multi-level feature extraction; S2: Based on multi-level feature representation, multi-scale feature fusion is performed through a feature pyramid network to generate fused multi-scale features; S3: Based on the fused multi-scale features, attention processing is performed on the feature maps of each resolution level through a dynamic detection head to generate classification scores and bounding box prediction results for each resolution level; S4: Based on the classification score and bounding box prediction results, the training loss is calculated using the weighted bounding box loss function. A weight coefficient higher than the standard weight is applied to bounding boxes with an area lower than the preset threshold. The model parameters are updated through backpropagation to generate the trained detection model. S5: Input the maize image to be detected into the trained detection model and perform forward propagation. Perform non-maximum suppression on the output bounding box prediction results to generate the detection location and category results of maize diseases. Step S1 includes: S11: Divide the input corn image into non-overlapping image blocks. After flattening each image block, map it to the initial number of channels through a linear projection layer to generate the initial feature representation. S12: Input the initial feature representation into multiple stacked SwinTransformerBlocks. Each SwinTransformerBlock uses ordinary window attention and shifting window attention alternately for feature extraction. After downsampling through the PatchMerging layer, a multi-scale feature hierarchy is constructed with the number of channels doubling and the spatial resolution halving step by step, generating a multi-level feature representation. The PatchMerging layer in step S12 also includes multi-scale feature enhancement processing, which includes: S121: The PatchMerging layer performs 2×2 window merging and channel concatenation on the input features to generate intermediate features with 4C channels and a spatial resolution of H / 2×W / 2. The intermediate features are then input into the multi-scale feature enhancement module, where C represents the number of channels, H represents the feature map height, and W represents the feature map width. S122: The multi-scale feature enhancement module uses three parallel convolutions with kernel sizes of 3×3, 5×5, and 7×7 to extract features from the intermediate feature tensor, capturing feature information within different receptive fields and generating three feature maps. S123: The three feature maps are fused through 1×1 convolution to generate a fused feature representation; S124: Input the fused feature representation into the squeeze excitation channel attention module, enhance the feature response of small target diseases through channel-level weighting, and generate an enhanced feature representation; reduce the number of channels from 4C to 2C through the dimension reduction mapping layer to generate a part of the enhanced multi-level feature representation, and pass it into the subsequent SwinTransformerBlock; In step S12, after the attention calculation, the SwinTransformerBlock also includes cross-scale self-attention processing. Specifically, in the stacked SwinTransformerBlock sequence, cross-scale self-attention processing is applied to every other block. The cross-scale self-attention processing includes: S125: Receive the output features calculated by window attention W-MSA or shifted window attention SW-MSA in the current SwinTransformerBlock and input them into the cross-scale self-attention module; S126: The cross-scale self-attention module computes attention features of two different window sizes in parallel, capturing contextual information within the local window and across the window respectively, and generating two attention feature maps; S127: Input the two attention feature maps into the fusion layer to integrate cross-scale features and generate cross-scale fused features; S128: Input the cross-scale fusion features into the squeeze excitation channel attention module for channel-level weighting to generate enhanced cross-scale attention features, which are then passed to the next layer as the output of the current SwinTransformerBlock.
2. The method for detecting maize diseases based on lightweight Transformer according to claim 1, characterized in that, Step S2 includes: S21: Receive multi-level feature representation, which includes multiple feature levels with the number of channels doubling and the spatial resolution halving at each level. Through the neck module of the feature pyramid network, channel unification and scale alignment of each level of features are performed to generate a multi-scale feature list with a unified number of channels, including high-resolution features from shallow networks and low-resolution features from deep networks. S22: In the top-down path of the feature pyramid network, content-aware upsampling is performed on the high-resolution feature map. An adaptive recombination kernel is generated through kernel prediction. The high-resolution feature map is then weighted and recombined based on the adaptive recombination kernel to generate the upsampled high-resolution feature map. The original upsampling method is kept unchanged for the low-resolution feature map. S23: The upsampled high-resolution feature map is fused with the corresponding low-resolution feature map, and multi-scale features are generated by fusion at each level.
3. The method for detecting maize diseases based on a lightweight Transformer according to claim 2, characterized in that, Step S22 includes: S221: Perform channel compression convolution on the high-resolution feature map to generate compressed features for kernel prediction; S222: Based on compressed features, an adaptive recombination kernel is generated through a convolutional layer. The adaptive recombination kernel is normalized and reshaped into a kernel tensor corresponding to the spatial location of the feature map. S223: Perform preliminary upsampling on the high-resolution feature map to generate a preliminary upsampled feature map; S224: Based on the normalized adaptive recombination kernel, the initial upsampled feature map is recombined position by position to generate a content-aware upsampled feature map; S225: Perform a post-processing convolution operation on the content-aware upsampled feature map to generate an upsampled feature map that matches the size of the corresponding low-resolution feature map from the deep network.
4. The method for detecting maize diseases based on a lightweight Transformer according to claim 1, characterized in that, Step S3 includes: S31: Receive the fused multi-scale feature list, which contains multiple feature maps of different resolutions, and independently perform three-level attention chain processing of scale attention, spatial attention and task attention on the feature map resolution level of each resolution level in the list. S32: Apply a scale attention mechanism to the feature map of each resolution level to dynamically adjust the response weights of features at different resolutions and generate a scale-weighted feature map. S33: Based on scale-weighted feature maps, spatial locations are weighted by a 3×3 convolutional spatial attention layer to generate refined spatial feature maps; S34: Based on spatially refined feature maps, the spatial dimension is compressed through global average pooling, and then two levels of 1×1 convolutional layers and ReLU activation function are used to generate task-level feature representations. S35: Based on task-level feature representation, classification scores and bounding box prediction results for each resolution level are generated using the classification head and regression head, respectively.
5. The method for detecting maize diseases based on a lightweight Transformer according to claim 1, characterized in that, Step S4 includes: S41: Calculate the area value of each predicted bounding box in the current training batch; S42: Compare the area value of the actual labeled bounding box with the preset area threshold. Mark the bounding box with an area value less than the area threshold as a small target bounding box, and mark the actual labeled bounding box with an area value greater than or equal to the area threshold as a standard weighted bounding box. S43: Multiply the regression loss of the predicted bounding box that is actually labeled as a small target bounding box by a preset weight coefficient to generate a weighted bounding box regression loss; keep the weight coefficient of the regression loss of the predicted bounding box that is actually labeled as a standard weighted bounding box at 1.0 to generate a standard bounding box regression loss. S44: Summing the weighted bounding box regression loss with the standard bounding box regression loss generates the overall weighted bounding box regression loss, which together with the classification loss constitutes the overall training loss; S45: Perform backpropagation based on the overall training loss, update the model parameters, and repeat S41-S45 until the model converges, generating the trained detection model.
6. The method for detecting maize diseases based on lightweight Transformer according to claim 1, characterized in that, The S5 steps include: S51: Decode the bounding box prediction results at each resolution level, convert the prediction results into bounding box coordinates in the image coordinate system, and generate a set of candidate detection boxes; S52: Sort the candidate detection box set in descending order of classification score, select the detection box with the highest classification score as the reference box, perform non-maximum suppression processing, remove overlapping detection boxes with the reference box whose intersection-union ratio exceeds the preset intersection-union ratio threshold, repeat the process of step S52 until all candidate detection boxes are processed, and generate the final detection box set. S53: Based on the final set of detection boxes, output the disease category label, bounding box coordinates and confidence score corresponding to each detection box to complete the detection and location of corn diseases.
7. The method for detecting maize diseases based on a lightweight Transformer according to claim 5, characterized in that, The preset area threshold is 32×32 pixels, and the preset weight coefficient is 2.0.