Crop leaf disease identification method based on visual-linguistic model

By generating structured semantic descriptions of diseases through a vision-language model and utilizing cross-attention and gating fusion mechanisms to achieve cross-modal alignment and adaptive fusion, this technology solves the problem of insufficient accuracy in single visual modality recognition in existing technologies, improves the accuracy and stability of crop leaf disease identification, and is suitable for mobile deployment.

CN122265698APending Publication Date: 2026-06-23NANTONG UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
NANTONG UNIV
Filing Date
2026-02-25
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing methods for identifying crop leaf diseases rely on a single visual modality, resulting in insufficient recognition accuracy and robustness, unstable cross-domain generalization, and a lack of textual semantic description and cross-modal alignment mechanisms.

Method used

A visual-language model is used to generate structured semantic descriptions of diseases, and cross-modal alignment and adaptive fusion of image and text features are achieved through cross-attention and gating fusion mechanisms. A lightweight encoder is used to improve recognition performance.

Benefits of technology

It significantly improves the accuracy and stability of crop leaf disease identification, especially in complex environments, and can effectively suppress background interference and noise, making it suitable for mobile deployment.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122265698A_ABST
    Figure CN122265698A_ABST
Patent Text Reader

Abstract

The application discloses a crop leaf disease identification method based on a vision-language model, and belongs to the technical field of intelligent identification of crop leaf disease images. First, an original image of a crop leaf is acquired and preprocessed; the preprocessed image is input into a vision-language model to generate a structured disease text description containing three layers of information, namely, overall distribution, local disease spot and color texture; image spatial features and text semantic features are respectively extracted through an image encoder and a pre-trained text encoder; a cross-attention module is used to realize the guided alignment of the text semantic features to the image features; a gating fusion module is used for self-adaptive weighted selection of multi-modal features, and a post-processing network and a classifier are used to output a disease category. The application introduces structured semantic information, improves the recognition accuracy and the adaptability to complex environments, adopts lightweight design, is convenient for deployment on mobile terminals or edge devices, and can be widely applied to crop disease early warning and precise pesticide application scenes.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of intelligent image recognition technology for crop leaf diseases, specifically to a method for recognizing crop leaf diseases based on a vision-language model. Background Technology

[0002] Crop diseases can lead to decreased photosynthetic efficiency and reduced nutrient accumulation in leaves, resulting in yield reduction. Rapid and accurate disease identification is a crucial foundation for disease early warning and precision pesticide application. For a long time, disease identification has mainly relied on the experience of plant protection personnel for visual diagnosis, which suffers from problems such as high labor intensity, low efficiency, subjective influence on diagnostic results, and difficulty in large-scale implementation.

[0003] With the widespread adoption of mobile devices and the development of computer vision, automatic recognition based on leaf images has become a research hotspot. Early methods often used handcrafted features such as color, texture, and shape in combination with traditional classifiers for recognition. However, these handcrafted features are sensitive to lighting conditions, background complexity, and differences in crop varieties, and are difficult to cover the diverse variations in lesion morphology, resulting in limited generalization ability.

[0004] Deep convolutional neural networks can learn image features end-to-end and achieve high classification accuracy under controlled acquisition conditions. However, when the test images come from real field environments or differ from the distribution of the training data, the model performance often drops significantly, reflecting the obvious generalization limitations of relying on a single visual modality in cross-domain scenarios.

[0005] A search of domestic and international literature and patents revealed that the existing patent "A Method for Plant Leaf Disease Detection and Classification Based on the Vision Transformer (ViT) Model" (application number: CN202210713969.8) proposes to extract features from leaf images using a ViT network and combine this with a classification module to output disease category results. Utilizing the Transformer's strong ability to model global dependencies, it can, to some extent, improve feature representation and recognition accuracy in complex backgrounds. However, this approach still relies primarily on a single visual modality and lacks structured utilization of interpretable semantic information such as the "overall distribution, local morphology, color, and texture" of diseases. Furthermore, Transformer-type models are typically more sensitive to the scale of training data and computational resources, requiring a balance between model complexity and real-time performance when deployed on mobile or edge devices.

[0006] To further improve the stability of detection and identification in complex environments, existing technologies have proposed leaf disease detection schemes using lightweight backbone networks. The existing patent, "Plant Leaf Disease Detection Method Based on Improved MobileNetV3" (application number: CN202310741871.8), proposes to improve the lightweight backbone network to reduce the number of parameters and computational overhead, thereby achieving rapid diagnosis of plant diseases and improving engineering deployability. However, this type of method still mainly relies on visual feature learning, and is still a single-modal recognition framework. Its robustness under complex field conditions such as light changes, background interference, occlusion, and fine-grained similar diseases still has room for improvement. Furthermore, it lacks an explicit representation and interpretation output mechanism for lesion semantics, making it difficult to further improve cross-scenario generalization and interpretability while maintaining lightweight design.

[0007] In terms of academic research, the existing paper "A Method for Diagnosing Crop Leaf Diseases Based on Multi-Task Learning" (China Agricultural Science and Technology Herald, 2024, No. 26(1)) introduces channel / spatial attention into MobileNetV3 and combines it with feature pyramids to construct a multi-task network to simultaneously identify crop type, disease type and disease severity. Experiments show that the average accuracy and recall on the three tasks are improved compared to the original model and are better than the comparison models such as MobileNetV3, InceptionV3, and YOLOv7. Its shortcomings are that the method is still mainly based on single visual information, and multi-task learning requires additional disease severity labeling and task design. The data preparation and training process is more complicated, making it difficult to further improve in terms of "interpretable semantic introduction" and "cross-modal robustness".

[0008] The existing paper "Plant Leaf Disease Recognition Based on Contrastive Learning" (Zhejiang Journal of Agricultural Sciences, 2024, Vol. 36(1)) trains a ResNet50 encoder based on self-supervised contrastive learning (such as MoCo-v2, DeepCluster-v2, SwAV, BYOL) and verifies its recognition effect in Linear / Fine-tune mode on Plant Village and a self-built cotton disease dataset. The results show that the highest recognition accuracy can reach 99.83%, and some methods can reach 99.87% in Fine-tune mode. Its shortcomings are that it focuses on improving visual representation ability through contrastive learning, but does not explicitly integrate interpretable semantic information such as lesion morphology, color and texture. Moreover, the overall scheme relies on a heavy backbone network and training strategy, which has limited improvement on semantic alignment and noise resistance in complex field scenarios.

[0009] The existing paper "Lightweight Tomato Leaf Disease Recognition Based on Improved ShuffleNet v2" (published in May 2024, Issue 52(3)) achieves lightweight recognition by introducing improvements such as permutation attention (SA-stage), lightweight feature fusion module (LFN), lightweight feature enhancement module (RFB-s), and SPD-Conv. It achieves an accuracy of 96.55% in 10 types of tomato leaf diseases, and the weight is only about 1.51MB, which has the advantage of mobile deployment. Its shortcomings are that the improvement mainly revolves around the single-modal convolutional network structure and feature enhancement. The method is more effective when it is adapted to specific crops / data distributions, but it still lacks an effective mechanism to support cross-crop and cross-scene generalization and "text semantic guidance for key lesion alignment".

[0010] In the field of crop leaf disease identification, existing methods still have the following shortcomings: (1) Single-modal visual information is easily affected by background, lighting and occlusion, and cross-domain generalization is unstable; (2) There is a lack of structured text description generation and standardization process for disease diagnosis, which makes it difficult for text semantics to participate stably in identification; (3) The fusion of image features and text features mostly adopts simple splicing or weighted addition, which makes it difficult to achieve fine-grained alignment and easily introduces noisy semantics; (4) There is a lack of mechanism for adaptively selecting more effective modal information based on sample differences, which is not robust enough in complex environments.

[0011] Therefore, there is an urgent need for an identification method that can generate structured semantic descriptions of diseases and achieve cross-modal alignment and adaptive fusion to improve the accuracy, stability and interpretability of crop leaf disease identification. Summary of the Invention

[0012] Purpose of the invention: To address the problems of insufficient accuracy and robustness in disease identification, unstable cross-domain generalization, and poor interpretability caused by relying solely on single image features in existing technologies, this invention proposes a method for identifying crop leaf diseases based on a vision-language model. By introducing structured semantic information and constructing a cross-modal alignment and adaptive fusion mechanism, the method improves identification performance and engineering deployability.

[0013] Technical solution: The present invention provides a method for identifying crop leaf diseases based on a vision-language model, comprising the following steps:

[0014] Step 1: Obtain the original image of the crop leaf surface and preprocess the image;

[0015] Step 2: Input the preprocessed crop leaf images into the vision-language model to generate a structured text description of the disease that includes the overall distribution features of the disease, the morphological features of local lesions, and the color and texture features.

[0016] Step 3: Construct an image encoder and a pre-trained text encoder to extract spatial features of the image and semantic features of the text, respectively;

[0017] Step 4: Construct a cross-attention module, using image features as the query and text features as the key and value to calculate attention weights. Use these weights to weight the text features to obtain fused features. Finally, combine the original image features with the fused features through residual connections to achieve cross-modal information fusion.

[0018] Step 5: Construct a gated fusion module, dynamically calculate the weights of image and text features according to the fusion type, achieve multimodal feature fusion through weighted summation, and finally optimize the feature representation through a post-processing network.

[0019] Furthermore, step 1 specifically includes the following steps:

[0020] Step 1.1, Image Acquisition: Acquire raw images of crop leaves using a mobile phone camera, industrial camera, or visible light camera from a drone. The image format should be JPG, PNG, or BMP.

[0021] Step 1.2: Convert the input image to RGB three channels uniformly; if the original image is BGR, perform channel rearrangement;

[0022] Step 1.3, Size Normalization and Cropping: Scale and crop the RGB image to the preset input size H×W=224×224 to obtain an image with uniform size;

[0023] Step 1.4, Pixel Normalization: Linearly normalize the image pixel values ​​from 0 to 255 to 0 to 1, and perform mean-variance normalization; then convert the image from H×W×C format to C×H×W format.

[0024] Furthermore, step 2 specifically includes the following steps:

[0025] Step 2.1, Visual-Language Model Input: The leaf image tensor X is used as the visual input, where the dimensions of X are Batch×3×224×224, where Batch is the number of images input to the model each time during model training, 3 is the number of RGB channels, and 224×224 is the image size; the visual-language model receives this batch of images and combines them with prompt words to perform inference, outputting Chinese text descriptions corresponding to each image;

[0026] Step 2.2, Prompt Word Template: The prompt word constrains the model output to Chinese and divides the output into three segments or three fields, corresponding to the overall distribution description, local lesion description, and color and texture description, respectively; the local lesion description must include lesion morphology and approximate size information; when a certain content cannot be determined, the output shows no obvious features to ensure the integrity of the fields;

[0027] Step 2.3, Structured Output Format: The visual-language model output adopts fixed template splicing or JSON format output. Finally, the overall distribution description, local lesion description, and color and texture description are combined in a predetermined order to obtain a structured disease text description.

[0028] Furthermore, step 3 specifically includes the following steps:

[0029] Step 3.1, Image Feature Extraction: Construct an image encoder using ShuffleNetV2 as the backbone network. Input the standard input leaf image into the image encoder. The tensor dimension X of the input image is Batch×C×H×W, where Batch represents the number of images input to the model each time during model training, C represents 3 RGB channels, and H×W represents the image's width and height, i.e., 224×224 pixels. Use a lightweight ShuffleNetV2 network, removing its classification output layer so that it is used only as a feature extractor. Output a global image feature vector for each image. The vector is a one-dimensional feature representation obtained by the sample, with a dimension of Batch×d. _v , where d _v The feature dimensions obtained after removing the classification layer from ShuffleNetV2;

[0030] Step 3.2, Image Feature Serialization and Dimensioning: Perform a projection mapping on the global image feature vector, changing its channel dimension from d... _v The image features are transformed into a pre-defined uniform dimension d=256, resulting in an aligned image feature vector. The aligned image features are still represented by a global vector of samples, with a dimension of Batch×d. This global image vector is then expanded in the sequence dimension to form an image sequence of length N, resulting in an image sequence feature of dimension Batch×N×d.

[0031] Step 3.3, Text semantic feature acquisition: Input the text feature sequence into the pre-trained text encoder for semantic encoding. The dimension of each word feature output by the text encoder is fixed at 768, and the overall dimension is Batch×L×768, where L is the length of the text sequence.

[0032] Step 3.4: Align the text feature dimensions: Map the text feature sequence to the channel dimension, transform the feature of each word from 768 dimensions to 256 dimensions, and obtain the aligned text feature sequence. The overall dimension is Batch×L×d, that is, d is 256. The aligned text feature sequence is converged in the word dimension to obtain the global text vector per sample, which has the dimension of Batch×d.

[0033] Step 3.5 Output: Output the aligned text sequence features, with dimensions Batch×L×d, and the image features, with dimensions Batch×1×d, as input to the cross-attention module and the gating fusion module.

[0034] Furthermore, step 4 specifically includes the following steps:

[0035] Step 4.1 Input Feature Preparation: The text word feature sequence and image feature sequence output from Step 3 are used as inputs to the cross-attention module, where the image feature sequence has a dimension of Batch×1×256 and the text feature sequence has a dimension of Batch×L×256.

[0036] Step 4.2 Calculate attention weights: Using image features as the query and text features as the key and value, calculate the correlation between the two to obtain the weight score matrix. Normalize the score matrix in the word dimension using the Softmax function to obtain the attention weight matrix.

[0037] Step 4.3, Cross-modal weighted aggregation: Use the attention weight matrix to perform weighted aggregation on the text value features to obtain fused features that can be aligned with image features;

[0038] Step 4.4, Residual Update and Normalization: The fused features and the original image features are added element by element to achieve residual connection. After the residuals are added, layer normalization is applied, and random deactivation is added to reduce overfitting. The updated image feature dimension is still Batch×1×256.

[0039] Step 4.5, Output Processing: Compress the 3D sequence image features into a 2D global vector with a dimension of Batch×256, and output the image fusion features and the weighted aggregated text representation as input to Step 5.

[0040] Furthermore, step 5 specifically includes the following steps:

[0041] Step 5.1 Input Features: Use the global feature vector of image enhancement and the global feature vector of text as input to the gated fusion module. The dimension of both types of features is Batch×256.

[0042] Step 5.2, Gating Weight Generation: Concatenate image features and text features along the feature dimension to obtain a fused input vector with dimension Batch×512; input the fused input vector into a gating network, which includes a first layer MLP and a Softmax function. The first layer MLP consists of a fully connected layer + ReLU activation function + layer normalization + fully connected layer concatenated. The gating network outputs two weight scores, which are normalized by Softmax to obtain gating weights corresponding to image features and text features respectively, and the sum of the two is 1.

[0043] Step 5.3, Gated Weighted Fusion: The global feature vector of the image and the global feature vector of the text are weighted element by element using gating weights, and the weighted results are added together to obtain the fused feature vector;

[0044] Step 5.4, Feature Reconstruction: Input the fused feature vector into the second MLP layer for feature reconstruction. The second MLP layer contains a fully connected layer, a GELU activation function, layer normalization, and random deactivation operation. The dimension of the reconstructed feature vector remains 256.

[0045] Step 5.5, Classifier Output: Input the reconstructed fused features into the classifier, which consists of two fully connected layers. After nonlinear transformation and normalization, the classifier outputs the probability of each category and selects the category with the highest probability as the final disease identification result.

[0046] The present invention also discloses a computer device, including a memory, a processor, and a computer program stored in the memory, wherein the processor executes the computer program to implement the steps of the method of the present invention.

[0047] The present invention also discloses a computer-readable storage medium having a computer program / instructions stored thereon, which, when executed by a processor, implements the steps of the method of the present invention.

[0048] The present invention also discloses a computer program product, including a computer program / instructions that, when executed by a processor, implement the steps of the method of the present invention.

[0049] Beneficial effects: Compared with the prior art, the present invention has the following significant advantages:

[0050] 1) This invention generates and standardizes the structured semantic description of diseases, which consists of “overall distribution, local morphology and color texture”, through a visual-language model. This enables the model to use interpretable semantic information to participate in the discrimination, breaking the limitations of a single visual modality and thus significantly improving the recognition accuracy, especially in fine-grained similar disease recognition scenarios.

[0051] 2) Employing a cross-attention mechanism to guide and enhance visual features through textual semantics enables image features to actively focus on key lesion information consistent with the disease description, effectively suppressing background interference and irrelevant features, improving the discriminativeness of feature expression, and enhancing cross-domain generalization ability.

[0052] 3) The gating fusion mechanism is used to adaptively weight text features, visual enhancement features and cross-attention features. It can dynamically select more effective modal information and suppress noise interference according to the feature differences of different samples, thereby improving the robustness of the model in complex field environments such as light changes, complex backgrounds and lesion occlusion.

[0053] 4) The image side adopts the lightweight encoder ShuffleNetV2, the text generation process can be executed offline and reused, the overall model has fewer parameters and lower computational overhead, making it easier to deploy and apply on mobile or edge devices, adapting to actual field detection scenarios, and has strong engineering practicality.

[0054] 5) During the training phase, only the cross-attention module, gating fusion module, and classifier are trained, while the parameters of the pre-trained visual-language model and text encoder are frozen. This reduces training complexity and computational resource consumption, shortens the training cycle, and leverages the powerful representational capabilities of the pre-trained model to improve model performance and stability. Attached Figure Description

[0055] Figure 1 This is a flowchart of intelligent identification technology for crop leaf diseases.

[0056] Figure 2 This is a diagram of the cross-attention module structure.

[0057] Figure 3 Here is a structural diagram of the gating feature fusion module;

[0058] Figure 4 A comparative experimental effect diagram was introduced into the semantics of the soybean leaf disease dataset step by step;

[0059] Figure 5 Comparison of ablation experiment results for soybean leaf disease dataset module. Detailed Implementation

[0060] The technical solution of the present invention will be further described below with reference to the accompanying drawings.

[0061] like Figure 1 As shown, the present invention adopts the following technical solution: a method for identifying crop leaf diseases based on a vision-language model, comprising the following steps:

[0062] Step 1) Obtain the original image of the crop leaf surface and preprocess the image;

[0063] Step 2) Input the preprocessed crop leaf images into the vision-language model to generate a structured text description of the disease that includes the overall distribution features of the disease, the morphological features of local lesions, and the color and texture features.

[0064] Step 3) Construct an image encoder (such as ShuffleNetV2) and a pre-trained text encoder (such as BLIP2) to extract image spatial features and text semantic features respectively;

[0065] Step 4) Construct a cross-attention module, using image features as the query and text features as the key and value to calculate attention weights. Use these weights to weight the text features to obtain fused features. Finally, combine the original image features with the fused features through residual connections to achieve cross-modal information fusion.

[0066] Step 5) Construct a gated fusion module, dynamically calculate the weights of image and text features according to the fusion type, achieve multimodal feature fusion through weighted summation, and finally optimize the feature representation through a post-processing network.

[0067] Furthermore, the specific content of step 1) is as follows:

[0068] Step 1 includes image acquisition, color space processing, size normalization, data augmentation, and pixel normalization. The logic is as follows: first, the original leaf images from different sources and with different resolutions are unified to a fixed input size and numerical range so that the encoder can extract features stably in the subsequent process.

[0069] (1) Image acquisition: Obtain original images of crop leaves through mobile phone camera, industrial camera or drone visible light camera. The image format can be JPG, PNG or BMP, and the original resolution is not limited.

[0070] (2) Convert the input image to RGB three channels uniformly; when the original image is BGR, perform channel rearrangement.

[0071] (3) Size normalization and cropping: Scale and crop the RGB image to the preset input size H×W=224×224 to obtain an image with uniform size.

[0072] (4) Pixel normalization: The image pixel values ​​are linearly normalized from 0 to 255 to 0 to 1, and mean-variance normalization is performed; then the image is converted from H×W×C format to C×H×W format.

[0073] Furthermore, the specific content of step 2) is as follows:

[0074] A pre-trained visual-language model (selecting a large multimodal model capable of understanding images and outputting Chinese descriptions, such as GLM-4.6V) is used. Pre-processed crop leaf images are input into the model to generate structured textual descriptions of diseases, including overall disease distribution features, local lesion morphology features, and color and texture features. During training, the parameters of the visual-language model and the pre-trained text encoder are not updated; both models remain at their current parameters. Training is only performed on the cross-attention module, the gating fusion module, and the classifier. The implementation is as follows:

[0075] (1) Visual-Language Model Input: The leaf image tensor X obtained in step 1 is used as the visual input, where the dimension of X is Batch×3×224×224 (Batch is the number of images input to the model each time during model training, 3 is the number of RGB channels, and 224×224 is the image size). The visual-language model receives this batch of images and combines them with prompt words for inference, outputting Chinese text descriptions corresponding to each image. At this time, the model output is a string sequence (per sample), the length of which is variable and not represented by a fixed dimension. The visual-language model can be any multimodal model capable of processing images and text.

[0076] (2) Prompt word template: To ensure stable output and facilitate subsequent processing, the prompt words should at least constrain the model output to be in Chinese and divide the output into three segments or three fields, corresponding to "overall distribution description", "local lesion description" and "color and texture description" respectively; among them, "local lesion description" should include lesion morphology (such as dot-shaped, ring-shaped, irregular, etc.) and approximate size information, with the size unit expressed in "mm"; at the same time, it is recommended that the length of each segment be controlled within a moderate range to avoid redundancy; when a certain content cannot be determined, output "no obvious features found" to ensure the integrity of the fields.

[0077] (3) Structured output format: The visual-language model output can be spliced ​​with a fixed template or output in JSON format. Finally, the "overall distribution description, local lesion description, color and texture description" are combined in a predetermined order to obtain a structured disease text description.

[0078] Furthermore, the specific content of step 3) is as follows:

[0079] Step 3 is used to extract features from crop leaf images and structured disease text respectively, providing a basic representation for subsequent cross-modal interaction and fusion. Its core lies in extracting image spatial features through an image encoder and extracting text semantic features through a text encoder. The specific implementation is as follows:

[0080] (1) Image Feature Extraction: An image encoder is constructed, using ShuffleNetV2 as the backbone network. The standard input leaf image obtained in step 1 is input into the image encoder. The tensor dimension X of the input image is Batch×C×H×W, where Batch is the number of images input to the model each time during model training, C represents the number of RGB channels as 3, and H×W represents the image length and width, i.e., 224×224 pixels. The lightweight network ShuffleNetV2 is used, and its classification output layer is removed, so that it is used only as a feature extractor. A global image feature vector is output for each image. This vector is a one-dimensional feature representation obtained by the sample, and its dimension is Batch×d_v, where d_v is the feature dimension obtained by removing the classification layer of ShuffleNetV2.

[0081] (2) Image Feature Serialization and Dimension: To facilitate cross-modal computation with text features, the global image feature vector is projected and mapped to transform its channel dimension from d_v to a preset unified dimension d=256, resulting in an aligned image feature vector. This aligned image feature is still represented by a global vector of samples, with a dimension of Batch×d. Since the subsequent cross-attention module receives query input in a "sequence" format, before entering the cross-attention module, the global image vector is expanded into an image sequence of length N in the sequence dimension, thus obtaining an image sequence feature with a dimension of Batch×N×d. Since the global image feature already contains the comprehensive discriminative information of the entire image, which is sufficient to serve as query information to focus on the relevant semantics in the text, keeping a single image token can effectively control the computational complexity and improve the model's inference efficiency. Therefore, N=1 is taken, thus obtaining an image sequence feature with a dimension of Batch×1×d.

[0082] (3) Text semantic feature acquisition: Input the text feature sequence obtained in step 2 into the pre-trained text encoder (such as BLIP2) for semantic encoding. The dimension of each word feature output by the text encoder is fixed at 768. Therefore, the form of the text feature sequence when entering step 3 is "each text corresponds to L words, and each word is a 768-dimensional semantic vector", with an overall dimension of Batch×L×768, where L is the length of the text sequence.

[0083] (4) Text feature dimension alignment: To maintain consistency with image features, channel dimension mapping is performed on the text feature sequence, transforming the feature of each word from 768 dimensions to 256 dimensions to obtain the aligned text feature sequence with an overall dimension of Batch×L×d, i.e., d is 256. Subsequently, to obtain the global text vector for subsequent fusion, the aligned text feature sequence is aggregated along the word dimension (e.g., averaging the L word features) to obtain the global text vector per sample with a dimension of Batch×d.

[0084] (5) Output: Step 3 outputs the aligned text sequence features, which have the dimension of Batch×L×d, and the image features have the dimension of Batch×1×d, which are used as inputs to the cross-attention module in Step 4 and the gated fusion module in Step 5.

[0085] Furthermore, the specific content of step 4) is as follows:

[0086] Step 4 is used to construct the cross-attention module. Attention weights are calculated using "image features as the query, text features as the key and value," and these weights are then used to selectively aggregate text semantic information to obtain fused features aligned with the image features. Finally, the original image features and the fused features are combined through residual connections to achieve cross-modal information fusion. The specific implementation is as follows:

[0087] Input Feature Preparation: The text word feature sequence and image feature sequence output from step 3 are used as inputs to the cross-attention module. The image feature sequence has a dimension of Batch×1×d, representing the image discrimination information of the leaf image. The text feature sequence has a dimension of Batch×L×d, used to represent semantic information such as the overall distribution of the disease, the morphology of local lesions, and color texture. Each word is a semantic representation in the same dimensional space as the image features.

[0088] Attention weights are calculated by using the relevance between image query information and text key matching information as the basis for weight calculation, resulting in a weight score matrix that reflects the "attention intensity of image features on different text terms". This score matrix is ​​then normalized using the Softmax function on the term dimension corresponding to k, ensuring that the sum of the weight coefficients for each query is 1, thus obtaining the attention weight matrix. The Softmax normalization converts the relevance scores into interpretable probabilistic weights for subsequent weighted aggregation.

[0089] Cross-modal weighted aggregation: The attention weights obtained in (2) are used to weight and aggregate the text value information to obtain a fusion representation related to the image query. For the weight coefficients corresponding to each image query, the text value information is weighted and summed according to the weights to achieve "reading the information most relevant to the image query from the text features". This enables the image features to actively focus on and extract semantic information related to the current image in the text. This weighted aggregation process can be implemented by "matrix multiplication of weights and values ​​ / batch weighted summation", which is essentially a weighted aggregation of text value information.

[0090] (4) Residual update and normalization: The fused features obtained in (3) are added element-wise to the original image features at corresponding positions to achieve residual connection. After the residuals are added, a layer normalization function is used for normalization to stabilize the training and suppress numerical fluctuations caused by differences in different modal scales. Random dropout can also be added before and after residual superposition or normalization to reduce overfitting. The updated image feature dimension is still Batch×1×d.

[0091] (5) Output Processing: The subsequent gated fusion module needs to process the global feature vector for weighted summation, but does not require sequence information. Therefore, the three-dimensional sequence tensor is compressed into a two-dimensional global vector to facilitate gated weighted fusion with the text global features. The three-dimensional sequence image features are compressed in the sequence dimension, restoring the global feature vector of dimension Batch×d from the sequence tensor of dimension Batch×1×d. The compressed image fusion features are output, along with the text weighted aggregation representation formed in the cross-attention module, as input to the subsequent gated fusion module.

[0092] Furthermore, the specific content of step 5 is as follows:

[0093] Step 5 constructs a gated fusion module to adaptively weight and fuse text semantic features and image enhancement features, and inputs the fusion result into a classifier to output the crop leaf disease type identification result. The logic is as follows: global features from different modalities are gated and weighted; the gated network dynamically calculates the importance weights of the two features based on the input features; the two types of features are weighted and summed to obtain the fused feature; finally, the disease category prediction is obtained through a post-processing network and a classifier. The gated fusion module includes a first-layer MLP, a Softmax layer, and a second-layer MLP. MLP (Multilayer Perceptron) is a multilayer perceptron consisting of a fully connected layer + activation function + layer normalization + fully connected layer connected in series.

[0094] Input features: The image enhancement global feature vector output in step 4 and the text global feature vector obtained in step 3 are used as inputs to the gated fusion module. Both types of features are one-dimensional vectors obtained from samples, with a dimension of Batch×d, thus maintaining consistency in numerical scale and dimension, which facilitates subsequent fusion calculations.

[0095] (2) Gated Weight Generation (First Layer MLP + Softmax): Image features and text features are concatenated along the feature dimension to form a fused input vector of dimension Batch×2d. This fused input vector is then fed into a gating network, which sequentially includes a fully connected mapping, a ReLU activation function, layer normalization, another fully connected mapping, and a Softmax function, outputting two weight scores. The two weight scores are normalized using the Softmax function so that the sum of the two weights is 1, thus obtaining the gating weights corresponding to the image features and text features respectively, used to represent the relative importance of the two types of information in the current sample.

[0096] (3) Gated Weighted Fusion: Based on the obtained gating weight coefficients, the two types of global feature vectors are weighted element-wise, and the weighted results are summed to obtain the fused feature vector. The image gating weight is multiplied by the image feature, and the text gating weight is multiplied by the text feature. Then the two weighted results are summed to obtain the final fused feature. This process is equivalent to performing weighted summation of features from different sources in a unified feature space, so that features with greater information content occupy a higher proportion in the fused result.

[0097] (4) Feature Reconstruction (Second Layer MLP): The fused feature vectors are input into the feature reconstruction network for further transformation to improve the expressive power and robustness of the fused features. The post-processing network consists of fully connected layers, GELU activation function, layer normalization, and dropout operation to reduce overfitting and enhance generalization ability. The dimension of the post-processed feature vectors remains d.

[0098] (5) Classifier output: The post-processed fusion features are input into the classifier for category prediction; the classifier consists of two fully connected layers. First, the fusion features are nonlinearly transformed by the first fully connected layer and the ReLU activation function, and then the prediction score corresponding to the number of disease categories is output by the second fully connected layer; the prediction score is normalized to obtain the probability of each category, and the category with the highest probability is selected as the final result of crop leaf disease identification.

[0099] Example:

[0100] In this embodiment, the publicly available soybean leaf disease dataset (Soybean Disease dataset) is selected to verify the method of the present invention. This dataset consists of field-captured images of soybean leaves suffering from diseases, including eight disease categories: bacterial leaf blight, brown spot, downy mildew, frog-eye leaf spot, target spot, soybean rust, potassium-deficient leaves, and healthy leaves, totaling 9648 leaf images. To ensure consistent evaluation, this embodiment divides the images into a 70% training set and a 30% test set ratio, maintaining a consistent sample ratio for each category (stratified random partitioning).

[0101] During the training phase, data augmentation was performed on leaf images, including random cropping and random horizontal flipping, to simulate scale and shooting direction changes that may occur during field collection. During the testing phase, the images were uniformly scaled to 256×256 and then cropped from the center to obtain a standard input of 224×224 to ensure consistent evaluation conditions.

[0102] To verify the effectiveness of the technical route of "structured semantic description generation, cross-modal attention enhancement and gating fusion" of the present invention, this embodiment designs a comparative experiment that introduces semantic information step by step: First, a single-modal benchmark model that only uses leaf images for recognition is constructed; on this basis, (1) a structured text description that only contains the overall distribution information of the disease is introduced, (2) local lesion morphology information is added in addition to the overall distribution information, and (3) three types of structured semantic information, namely overall distribution, local lesion morphology and color texture, are introduced at the same time, thereby forming a step-by-step comparison setting from weak to strong.

[0103] Experimental results show that on the soybean foliar disease dataset, the single-modal baseline model achieved an accuracy of 88.12%. After introducing the semantic description of "overall distribution," the accuracy improved to 95.53%. Further addition of the semantic description of "local lesion morphology" increased the accuracy to 97.65%. When all three types of structured semantic descriptions—overall distribution, local lesion morphology, and color / texture—were introduced simultaneously, the model achieved a final accuracy of 99.04%. These results demonstrate that, compared to the single-modal baseline model, this invention achieves continuous and stable performance improvement by gradually introducing structured semantic information, with a more significant final accuracy increase.

[0104] To further verify the contribution of the cross-attention module and the gating fusion module to the improvement of recognition performance, this embodiment conducts a module ablation experiment comparison on the soybean leaf disease dataset and sets four model configurations: (1) using only the baseline model; (2) adding only the cross-attention module to the baseline model; (3) adding only the gating fusion module to the baseline model; (4) adding both the cross-attention module and the gating fusion module (consisting of the complete scheme of this invention).

[0105] Experimental results show that the baseline model achieves a recognition accuracy of 93.89%; adding only the cross-attention module improves the accuracy to 96.43%; adding only the gated fusion module improves the accuracy to 97.35%; and when both the cross-attention and gated fusion modules are used simultaneously, the model achieves a recognition accuracy of 99.04%. These ablation experiment results demonstrate that the cross-attention module effectively achieves cross-modal semantic alignment and enhances image discriminative features, while the gated fusion module adaptively adjusts the contribution of different modal information based on sample differences. The combined use of both can further suppress noise interference and achieve better recognition performance, thus verifying the effectiveness and necessity of the technical solution of this invention.

[0106] Through comparative experiments and module ablation experiments involving the gradual introduction of semantic information, it is evident that this invention does not rely solely on color and texture differences in images for discrimination. Instead, it explicitly provides interpretable semantic cues such as the distribution characteristics of diseases, lesion morphology, and color texture through structured text descriptions, enabling the model to continuously gain performance gains after incorporating semantic information. Simultaneously, the cross-attention module and the gated fusion module enhance the discriminativeness and robustness of feature representation from two levels: "cross-modal alignment enhancement" and "adaptive weight selection," respectively, demonstrating complementary gains when used together. In particular, the accuracy continues to improve after adding color and texture descriptions, indicating that the structured representation of fine-grained color gradients and texture information helps distinguish between disease types with similar appearances, thus demonstrating the superiority of this invention in fine-grained foliar disease identification tasks.

[0107] The comparative experiments described above demonstrate that the method of this invention achieves an accuracy rate of over 99% on the soybean leaf disease dataset, and provides a stable improvement compared to the single-modal benchmark model that relies solely on image information. This indicates that the technical approach of "structured semantic description generation - cross-modal enhancement - gating fusion" proposed in this invention can effectively improve the accuracy, stability, and interpretability of crop leaf disease identification.

[0108] The above embodiments are merely preferred embodiments of the present invention. It should be noted that those skilled in the art can make several improvements and equivalent substitutions without departing from the principle of the present invention. All such improvements and equivalent substitutions to the claims of the present invention fall within the protection scope of the present invention.

Claims

1. A method for identifying crop leaf diseases based on a vision-language model, characterized in that, Includes the following steps: Step 1: Obtain the original image of the crop leaf surface and preprocess the image; Step 2: Input the preprocessed crop leaf images into the vision-language model to generate a structured text description of the disease that includes the overall distribution features of the disease, the morphological features of local lesions, and the color and texture features. Step 3: Construct an image encoder and a pre-trained text encoder to extract spatial features of the image and semantic features of the text, respectively; Step 4: Construct a cross-attention module, using image features as the query and text features as the key and value to calculate attention weights. Use these weights to weight the text features to obtain fused features. Finally, combine the original image features with the fused features through residual connections to achieve cross-modal information fusion. Step 5: Construct a gated fusion module, dynamically calculate the weights of image and text features according to the fusion type, achieve multimodal feature fusion through weighted summation, and finally optimize the feature representation through a post-processing network.

2. The method for identifying crop leaf diseases based on a vision-language model according to claim 1, characterized in that, Step 1 specifically includes the following steps: Step 1.1, Image Acquisition: Acquire raw images of crop leaves using a mobile phone camera, industrial camera, or visible light camera from a drone. The image format should be JPG, PNG, or BMP. Step 1.2: Convert the input image to RGB three channels uniformly; if the original image is BGR, perform channel rearrangement; Step 1.3, Size Normalization and Cropping: Scale and crop the RGB image to the preset input size H×W=224×224 to obtain an image with uniform size; Step 1.4, Pixel Normalization: Linearly normalize the image pixel values ​​from 0 to 255 to 0 to 1, and perform mean-variance normalization. The image was then converted from H×W×C format to C×H×W format.

3. The method for identifying crop leaf diseases based on a vision-language model according to claim 1, characterized in that, Step 2 specifically includes the following steps: Step 2.1, Visual-Language Model Input: The leaf image tensor X is used as the visual input, where the dimensions of X are Batch×3×224×224, where Batch is the number of images input to the model each time during model training, 3 is the number of RGB channels, and 224×224 is the image size; the visual-language model receives this batch of images and combines them with prompt words to perform inference, outputting Chinese text descriptions corresponding to each image; Step 2.2, Prompt Word Template: The prompt word constrains the model output to Chinese and divides the output into three segments or three fields, corresponding to the overall distribution description, local lesion description, and color and texture description, respectively; the local lesion description must include lesion morphology and approximate size information; when a certain content cannot be determined, the output shows no obvious features to ensure the integrity of the fields; Step 2.3, Structured Output Format: The visual-language model output adopts fixed template splicing or JSON format output. Finally, the overall distribution description, local lesion description, and color and texture description are combined in a predetermined order to obtain a structured disease text description.

4. The method for identifying crop leaf diseases based on a vision-language model according to claim 1, characterized in that, Step 3 specifically includes the following steps: Step 3.1, Image Feature Extraction: Construct an image encoder using ShuffleNetV2 as the backbone network. Input the standard input leaf image into the image encoder. The tensor dimension X of the input image is Batch×C×H×W, where Batch represents the number of images input to the model each time during model training, C represents 3 RGB channels, and H×W represents the image's width and height, i.e., 224×224 pixels. Use a lightweight ShuffleNetV2 network, removing its classification output layer so that it is used only as a feature extractor. Output a global image feature vector for each image. The vector is a one-dimensional feature representation obtained by the sample, with a dimension of Batch×d. _v , where d _v The feature dimensions obtained after removing the classification layer from ShuffleNetV2; Step 3.2, Image Feature Serialization and Dimensioning: Perform a projection mapping on the global image feature vector, changing its channel dimension from d... _v The image features are transformed into a pre-defined uniform dimension d=256, resulting in an aligned image feature vector. The aligned image features are still represented by a global vector of samples, with a dimension of Batch×d. This global image vector is then expanded in the sequence dimension to form an image sequence of length N, resulting in an image sequence feature of dimension Batch×N×d. Step 3.3, Text semantic feature acquisition: Input the text feature sequence into the pre-trained text encoder for semantic encoding. The dimension of each word feature output by the text encoder is fixed at 768, and the overall dimension is Batch×L×768, where L is the length of the text sequence. Step 3.4: Align the text feature dimensions: Map the text feature sequence to the channel dimension, transform the feature of each word from 768 dimensions to 256 dimensions, and obtain the aligned text feature sequence. The overall dimension is Batch×L×d, that is, d is 256. The aligned text feature sequence is converged in the word dimension to obtain the global text vector per sample, which has the dimension of Batch×d. Step 3.5 Output: Output the aligned text sequence features, with dimensions Batch×L×d, and the image features, with dimensions Batch×1×d, as input to the cross-attention module and the gating fusion module.

5. The method for identifying crop leaf diseases based on a vision-language model according to claim 1, characterized in that, Step 4 specifically includes the following steps: Step 4.1 Input Feature Preparation: The text word feature sequence and image feature sequence output from Step 3 are used as inputs to the cross-attention module, where the image feature sequence has a dimension of Batch×1×256 and the text feature sequence has a dimension of Batch×L×256. Step 4.2 Calculate attention weights: Using image features as the query and text features as the key and value, calculate the correlation between the two to obtain the weight score matrix. Normalize the score matrix in the word dimension using the Softmax function to obtain the attention weight matrix. Step 4.3, Cross-modal weighted aggregation: Use the attention weight matrix to perform weighted aggregation on the text value features to obtain fused features that can be aligned with image features; Step 4.4, Residual Update and Normalization: The fused features and the original image features are added element by element to achieve residual connection. After the residuals are added, layer normalization is applied, and random deactivation is added to reduce overfitting. The updated image feature dimension is still Batch×1×256. Step 4.5, Output Processing: Compress the 3D sequence image features into a 2D global vector with a dimension of Batch×256, and output the image fusion features and the weighted aggregated text representation as input to Step 5.

6. The method for identifying crop leaf diseases based on a vision-language model according to claim 1, characterized in that, Step 5 specifically includes the following steps: Step 5.1 Input Features: Use the global feature vector of image enhancement and the global feature vector of text as input to the gated fusion module. The dimension of both types of features is Batch×256. Step 5.2, Gating Weight Generation: Concatenate image features and text features along the feature dimension to obtain a fused input vector with dimension Batch×512; input the fused input vector into a gating network, which includes a first layer MLP and a Softmax function. The first layer MLP consists of a fully connected layer + ReLU activation function + layer normalization + fully connected layer concatenated. The gating network outputs two weight scores, which are normalized by Softmax to obtain gating weights corresponding to image features and text features respectively, and the sum of the two is 1. Step 5.3, Gated Weighted Fusion: The global feature vector of the image and the global feature vector of the text are weighted element by element using gating weights, and the weighted results are added together to obtain the fused feature vector; Step 5.4, Feature Reconstruction: Input the fused feature vector into the second MLP layer for feature reconstruction. The second MLP layer contains a fully connected layer, a GELU activation function, layer normalization, and random deactivation operation. The dimension of the reconstructed feature vector remains 256. Step 5.5, Classifier Output: Input the reconstructed fused features into the classifier, which consists of two fully connected layers. After nonlinear transformation and normalization, the classifier outputs the probability of each category and selects the category with the highest probability as the final disease identification result.

7. A computer device comprising a memory, a processor, and a computer program stored in the memory, characterized in that, The processor executes the computer program to implement the steps of the method of claim 1.

8. A computer-readable storage medium having a computer program / instructions stored thereon, characterized in that, When the computer program / instructions are executed by the processor, they implement the steps of the method of claim 1.

9. A computer program product comprising a computer program / instructions, characterized in that, When the computer program / instructions are executed by the processor, they implement the steps of the method of claim 1.