Dense small target pest counting method based on two-stage vision-language model
By combining visual and linguistic modalities through a two-stage vision-language model and employing phased training and a multi-task loss function, the problems of insufficient counting accuracy and semantic understanding in counting dense small-target pests are solved, achieving high counting accuracy and generalization ability.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- HEFEI INSTITUTE OF PHYSICAL SCIENCE CHINESE ACADEMY OF SCIENCES
- Filing Date
- 2026-03-13
- Publication Date
- 2026-06-16
AI Technical Summary
Existing technologies struggle to effectively combine visual and linguistic modalities to accurately identify and count densely packed small-target pests, especially in high-density scenarios where counting accuracy and semantic understanding are insufficient, and training strategies struggle to balance localization accuracy and semantic understanding.
We adopt a two-stage vision-language model-based approach, extracting features through an image encoder and a text encoder, combining a feature fusion module and a density map generator, using a large language model for semantic supervision, and employing a staged training strategy and a multi-task loss function to optimize the model.
It improves the counting accuracy and generalization ability of dense, small-target pests, the model training is stable and efficient, there is no additional computational overhead in the inference stage, and it is easy to apply in practice.
Smart Images

Figure CN122223752A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of computer vision technology, specifically to a method for counting dense small-target pests based on a two-stage vision-language model, as well as a computer terminal and computer-readable storage medium for applying this method. Background Technology
[0002] Accurate detection and counting of densely populated small-target pests is a key technical challenge in intelligent monitoring of agricultural pests and diseases. These pests typically include, but are not limited to, aphids, grain weevils, rice weevils, and booklice. They share common characteristics such as small size, dense distribution, similar morphological features, and easy confusion with the background. In various scenarios, including stored grains, greenhouse crops, and field agriculture, the rapid reproduction and high aggregation characteristics of these pests pose a significant challenge to accurate identification and counting. Currently, computer vision-based methods for detecting densely populated small-target pests mainly face the following technical bottlenecks: First, at the feature extraction level, the representation ability of densely packed small pests is severely insufficient. When the size of the pests is less than 32×32 pixels, as the depth of the convolutional neural network increases, a large amount of detailed texture information is lost in the shallow feature maps, while the spatial resolution of the deep feature maps is too low to preserve the morphological features of the pests. Although existing technologies attempt to enhance the features of small targets through methods such as feature pyramid networks and attention mechanisms, the feature confusion problem remains prominent in densely distributed scenarios. In particular, when the distance between pests is smaller than the individual size, the boundary features of adjacent pests interfere with each other, leading to a significant decrease in counting accuracy.
[0003] Secondly, at the model architecture design level, existing methods are mostly limited to a single visual modality, failing to effectively utilize prior semantic knowledge such as the biological learning characteristics and morphological features of pests. Traditional detection frameworks, such as single-stage detectors like YOLO and SSD, and two-stage detectors like Faster R-CNN, mainly rely on visual appearance features for identification, lacking an understanding of the species-specific semantics of pests. For different species of pests with similar morphologies, it is difficult to accurately distinguish them based solely on visual features, while introducing semantic information from textual descriptions can significantly improve the model's discriminative ability. Although Transformer-based detection models model global dependencies through self-attention mechanisms, their computational complexity increases quadratically with the number of pests when dealing with high-density pest populations, limiting their practicality.
[0004] Furthermore, at the level of training strategy optimization, existing methods generally adopt a single-stage end-to-end training approach, which makes it difficult to balance localization accuracy and semantic understanding in dense, small-target pest counting tasks. Single-stage training often leads to the model getting stuck in local optima in dense scenes, failing to fully utilize the knowledge transfer advantages of large-scale pre-trained language models. Especially in real-world applications, the data distribution of different pest species exhibits long-tail characteristics, with a limited number of samples for rare species, making traditional training methods prone to overfitting to the head category. In addition, pest counting is essentially a discrete numerical regression problem, and traditional detection loss functions are difficult to effectively handle the extreme imbalance between positive and negative samples in high-density scenes.
[0005] In recent years, visual-language models (such as CLIP) have demonstrated powerful cross-modal understanding capabilities through contrastive learning pre-training on large-scale image-text pairs. However, directly applying existing general multimodal models to the task of counting dense, small-target pests faces significant challenges: pre-training data lacks specialized annotations for tiny pests, and the model's ability to capture fine-grained features of pests is insufficient; general image-text alignment strategies are difficult to adapt to the dense distribution characteristics unique to pest counting; and a single contrastive learning objective cannot simultaneously optimize pest localization accuracy and population statistics accuracy. Therefore, developing a dedicated counting method that can effectively combine the advantages of visual and linguistic modalities and is optimized for the characteristics of dense, small-target pests has become a key technological requirement for improving the automation level of agricultural pest and disease monitoring. Summary of the Invention
[0006] To address the technical problems existing in the prior art, this invention provides a method for counting dense small-target pests based on a two-stage vision-language model. Through an innovative two-stage training strategy and a multimodal fusion architecture, it effectively combines the advantages of visual and linguistic modalities, thereby improving the counting accuracy and generalization ability of dense small-target pests.
[0007] To achieve the above objectives, the present invention provides the following technical solution: This invention discloses a method for counting dense small-target pests based on a two-stage vision-language model, comprising the following steps: S1. Collect pest images and perform block-level annotation to construct a dense dataset of small-target pest images; S2. Construct a dense small-target pest counting model, which includes an image encoder, a text encoder, a feature fusion module, a density map generator, a projector, and a large language model. The image encoder receives images of dense small-target pests and extracts high-resolution visual features that retain spatial details. The text encoder extracts semantic features from input text prompts. The feature fusion module calculates the similarity between visual features and text semantic features and fuses them to output multimodal features. The density map generator maps the multimodal features to a pest density map and counting results. The projector establishes a mapping relationship between visual features and the language semantic space. The large language model generates semantic descriptions based on the mapped visual features, and semantic supervision assists in model training. S3. A two-stage strategy is adopted to train the dense small target pest counting model. The first stage is to perform visual-linguistic modality pre-alignment training, and the second stage is to perform multi-task joint end-to-end training. S4. Acquire the image to be detected and preprocess it, input it into the trained dense small target pest counting model, and output the pest density map and counting results through the density map generator.
[0008] As a further improvement to the above scheme, in step S2, the image encoder is built based on the ResNet-50 architecture and consists of four convolutional stages. During the improvement process, the global average pooling layer and fully connected projection layer at the end of the original network are removed to preserve high-resolution spatial feature information. The specific processing procedure of the image encoder is as follows: An image with an input size of 224×224×3 first passes through a convolutional layer with a kernel size of 7×7 and a stride of 2, then through a max pooling layer with a kernel size of 3×3 and a stride of 2, and subsequently through the four convolutional stages, finally obtaining an image with a spatial size equal to the original... Figure 1 / 32, visual features with 2048 channels.
[0009] As a further improvement to the above scheme, in step S2, the text encoder adopts a pre-trained CLIP text encoder, and all parameters of the text encoder are kept frozen throughout the training process; the large language model adopts the LLaVA-OneVision-0.5B model, which generates image-level and region-level semantic descriptions during the training phase and discards them during the inference phase to improve computational efficiency.
[0010] As a further improvement to the above scheme, the projector adopts a two-layer multilayer perceptron structure; the first layer multilayer perceptron maps 2048-dimensional visual features to 1024-dimensional features using the GELU activation function; the second layer multilayer perceptron further maps the 1024-dimensional features to 512-dimensional features, which is consistent with the input dimension of the LLaVA-OneVision-0.5B model.
[0011] As a further improvement to the above scheme, the specific operation process of the feature fusion module is as follows: the 2048-dimensional visual features output by the image encoder are reduced to 512-dimensional through 1×1 convolution, while the text semantic features output by the text encoder are obtained, and then the softmax method is used for feature fusion.
[0012] As a further improvement to the above scheme, the density generator adopts an average interval counting strategy, using the average count of each interval as its representative point, and the calculation formula is as follows:
[0013] In the formula, Indicates the first each interval The representative counting point, , The number of predefined intervals; Representing an interval The cardinality; This represents the total number of blocks in the dataset. For indicator functions, Indicates the first The count value in each block; for any , interval and The intersection of them is not empty, and ,in The support set representing the count values, Represents the union; each interval Predicted density map The calculation formula is:
[0014] In the formula, Let represent the probability score of the p-th interval; For the first each interval The representative point.
[0015] As a further improvement to the above scheme, in step S3, the pre-alignment training in the first stage specifically involves: freezing the image encoder and the large language model, and training only the parameters of the projector; using image-description pairs as training data, and minimizing the following loss function... Optimize:
[0016] In the formula, Indicates the length of the descriptive text; Indicates the first Each word element, Indicates the preceding A sequence of lexical units; This represents the visual feature vector aligned by the projector. This represents the predicted probability of the current word given visual features and preceding words.
[0017] As a further improvement to the above scheme, in step S3, the multi-task joint end-to-end training in the second stage specifically involves: unfreezing the image encoder so that it participates in training together with the projector, feature fusion module, and density map generator; during the forward propagation process, the visual features extracted by the image encoder are simultaneously input into the following three branches: Counting branch: Outputs a predicted density map through the feature fusion module and density map generator; Image description branch: Generates image-level semantic descriptions through projectors and large language models; Region description branch: Extract local region features and generate region-level descriptions through a large language model; Employing a multi-task joint loss function Optimize:
[0018] In the formula, For location-aware cross-entropy loss, ; The block classification cross-entropy loss ensures that each image block is correctly classified into the corresponding counting interval; For density map counting loss, the total predicted count is constrained to match the true value; , , , All are weighting coefficients; For image-level description loss, ; This represents the t-th term in the image-level description loss; This represents the first t-1 terms of the image-level description loss; To describe the loss at the regional level, In the formula For the number of regions, For the first The length of the region description This represents the t-th term in the region-level description loss; The first t-1 terms represent the loss at the regional level.
[0019] The present invention also discloses a computer terminal, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, it implements the steps of the dense small target pest counting method based on a two-stage visual-language model as described above.
[0020] The present invention also discloses a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the dense small-target pest counting method based on a two-stage vision-language model as described above.
[0021] Compared with the prior art, the beneficial effects of the present invention are: 1. The present invention discloses a method for counting dense small-target pests based on a two-stage vision-language model. Through a phased training strategy, a semantic foundation is first established and then specialized tasks are optimized, making the training process more stable and efficient. Its multi-task loss function design maintains the original counting accuracy and enhances the semantic understanding ability of the model. The inference stage has no additional computational overhead, which facilitates practical deployment and application. Based on this, the counting accuracy and generalization ability of dense small-target pests are significantly improved.
[0022] 2. This invention discloses a method for counting dense, small-target pests based on a two-stage vision-language model. It designs a lightweight cross-modal mapping and semantic supervision sub-architecture (projector + LLaVA-0.5B), introducing the lightweight LLaVA-OneVision-0.5B for semantic supervision. The projector is designed as a two-layer MLP structure, specifically establishing an accurate mapping from visual features to the LLaVA-0.5B linguistic semantic space. This design enhances the model's cross-modal alignment capability and ensures efficient semantic supervision, thereby improving the model's understanding of dense, small-target pests and thus increasing the accuracy and generalization of the counting. Attached Figure Description
[0023] Figure 1 This is a flowchart of the dense small-target pest counting method based on a two-stage visual-language model in Embodiment 1 of the present invention.
[0024] Figure 2 This is a schematic diagram of the structure of the dense small-target pest counting model in Embodiment 1 of the present invention.
[0025] Figure 3 These are test images of the dense small-target pest counting model involved in Embodiment 1 of the present invention, applied to some pest images.
[0026] Figure 4 This is a schematic diagram of the structure of the computer terminal in Embodiment 2 of the present invention. Detailed Implementation
[0027] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0028] Example 1
[0029] Please see Figure 1 This embodiment provides a method for counting dense small target pests based on a two-stage visual-language model, including the following steps S1 to S4.
[0030] S1. Collect pest images and perform block-level annotation to construct a dense dataset of small-target pest images.
[0031] In this embodiment, before block-level labeling, the pest images are preprocessed to remove images that are out of focus, blurry, or have obvious halos.
[0032] S2. Construct a dense small-target pest counting model, which includes an image encoder, a text encoder, a feature fusion module, a density map generator, a projector, and a large language model.
[0033] like Figure 2 As shown, the image encoder is used to receive images of dense small-target pests and extract high-resolution visual features that retain spatial details; the text encoder is used to extract text semantic features of input text prompts; the feature fusion module is used to calculate the similarity between visual features and text semantic features and fuse them to output multimodal features; the density map generator is used to map the multimodal features to pest density maps and counting results; the projector is used to establish a mapping relationship between visual features and language semantic space; the large language model is used to generate semantic descriptions based on the mapped visual features, and semantic supervision is used to assist in the training of the model.
[0034] To ensure that each module fulfills its function, the design of each module is optimized to guarantee the expected output results. Furthermore, to ensure these modules can work together effectively, the input and output parameters of each module need to be adjusted to ensure seamless integration. The Large Language Model (LLM) and projector modules are primarily used for pre-alignment training in the two-stage training process. These modules are not needed for inference of the image to be predicted after the entire training process is complete, making model invocation and training simpler and easier to use.
[0035] (1) The image encoder is built on the ResNet-50 architecture and consists of 4 convolutional stages, including 49 convolutional layers and 1 global average pooling layer. In the improvement process, the global average pooling layer and the final fully connected projection layer at the end of the original network were removed to preserve high-resolution spatial feature information. The input multi-density pest image has a size of 224×224×3. It first passes through a convolutional layer with a kernel size of 7×7 and a stride of 2, then through a max pooling layer with a kernel size of 3×3 and a stride of 2, and then through 4 convolutional stages in sequence to finally obtain the image with a spatial size of 224×224×3. Figure 1 / 32, Feature map with 2048 channels.
[0036] (2) The text encoder is set to use the pre-trained CLIP text encoder, which is built based on the Transformer architecture and contains 12 attention layers, each with 12 attention heads and a hidden layer dimension of 768. During the entire training process, all parameters of the text encoder are kept frozen, including the weights and biases of the word embedding layer, the position encoding layer, the self-attention layer and the feedforward neural network layer.
[0037] (3) The large language model selected is the LLaVA-OneVision-0.5B model, which is based on the Vicuna-7B architecture for vision-language adaptation. It contains 32 Transformer decoder layers, each with a hidden dimension of 512 and 40 attention heads. The LLM is responsible for generating image-level and region-level semantic descriptions based on visual features, which can be discarded during the inference stage to improve computational efficiency.
[0038] (4) The projector adopts a two-layer multilayer perceptron (MLP) structure. The first layer maps the 2048-dimensional visual features to 1024 dimensions using the GELU activation function. The second layer further maps the 1024-dimensional features to 512 dimensions, consistent with the input dimension of the LLaVA-OneVision model. The projector remains trainable throughout the training process and is responsible for establishing the mapping relationship between visual features and the language semantic space.
[0039] (5) The feature fusion module is responsible for calculating the semantic similarity between image features and text features. The specific operation process is as follows: the feature map output by the improved CLIP image encoder is... Dimensionality reduced to 512 dimensions using 1×1 convolution. Simultaneously, the text features output by the text encoder are processed to obtain... Then, the softmax method was used for feature fusion.
[0040] (6) The density map generator adopts an average interval counting strategy. Our count support set is S={0,1, ...,m}, where m represents the maximum allowed count value. To solve the problem of uneven count distribution, we use the average count of each interval as its representative point ( The calculation formula is as follows:
[0041] In the formula, Indicates the first each interval The representative point, , The number of predefined intervals; Representing an interval The cardinality; This represents the total number of blocks in the dataset. For indicator functions, Indicates the first The count values in each block. Assume {Xi|i=1,...,n} is a predefined set of blocks, satisfying the condition for any... , interval and The intersection of them is not empty, and ,in The support set representing the count values, Represents the union; an n-dimensional vector at spatial location (i,j) This represents the probability score of the corresponding image region belonging to each interval. For each interval... ,set up Let be the representative count value. Then, the probability can be predicted using a probability plot. The predicted density map is obtained by using a weighted average method. The calculation formula is as follows:
[0042] In the formula, Let represent the probability score of the p-th interval; For the first each interval The representative point.
[0043] S3. A two-stage strategy is adopted to train the dense small-target pest counting model. The first stage is to perform visual-linguistic modality pre-alignment training, and the second stage is to perform multi-task joint end-to-end training.
[0044] Pre-alignment training: In this stage, we freeze the image encoder and LLM, training only the projector parameters. The training data uses image-description pairs, where the descriptions are either expert-annotated or automatically generated using an existing VLM. The loss is:
[0045] In the formula, Indicates the length of the descriptive text; Indicates the first Each word element, Indicates the preceding A sequence of lexical units; This represents the visual feature vector aligned by the projector. This represents the predicted probability of the current word given visual features and preceding words.
[0046] Joint Training: Building upon pre-alignment, this stage performs multi-task joint end-to-end optimization. The CLIP image encoder is unfrozen and trained alongside the projector, feature fusion module, and density map generator. The specific training process is as follows: Step 1, Data Preparation: Use multimodal training data that includes images, density map ground truth, image-level descriptions, and region-level descriptions.
[0047] The second step, the forward propagation process: The input image is processed by a defrosted CLIP image encoder to extract visual features, which are simultaneously fed into three branches: Counting branch: Outputs a predicted density map through the feature fusion module and density map generator; Image description branch: Generates image-level semantic descriptions through projectors and large language models; Region description branch: Extract local region features and generate region-level descriptions through a large language model.
[0048] Step 3, Multi-task loss calculation: A multi-task joint loss function is adopted, and its loss function is:
[0049] In the formula, For location-aware cross-entropy loss, ; The block classification cross-entropy loss ensures that each image block is correctly classified into the corresponding counting interval; For density map counting loss, the total predicted count is constrained to match the true value, and the weighting coefficients are... and It is usually set to 1.0 to ensure that the counting accuracy does not decrease.
[0050] Image-level description loss:
[0051] This represents the t-th term in the image-level description loss; The first t-1 terms represent the image-level description loss. Weight coefficients are used to prevent the image encoder from forgetting the semantic knowledge learned in the first stage when optimizing the counting task. Set it to 0.3-0.5.
[0052] Loss is described at the regional level:
[0053] In the formula For the number of regions, For the first The length of the region description This represents the t-th term in the region-level description loss; This represents the first t-1 terms used to describe the region-level loss. Weight coefficients. Set it to 0.1-0.3.
[0054] Step 4, parameter optimization: By simultaneously updating the parameters of the image encoder, projector, feature fusion module, and density map generator through the backpropagation algorithm, the LLM parameters can be selectively used in training or kept frozen.
[0055] Training objective: To optimize the pest counting accuracy of the model while maintaining semantic understanding capabilities, so that the model has both accurate counting ability and rich semantic understanding capabilities.
[0056] S4. Acquire the image to be detected and preprocess it, input it into the trained dense small target pest counting model, and output the pest density map and counting results through the density map generator.
[0057] like Figure 3 The image shown is a test result of the dense small-target pest counting model involved in this invention in some pest images. Figure 3 Group (a) shows comparison images of the model's counting performance on booklice, with an actual number of pests of 48 and a model output of 50; Group (b) shows comparison images of the model's counting performance on rice weevils, with an actual number of pests of 66 and a model output of 67; Group (c) shows comparison images of the model's counting performance on grain beetles, with an actual number of pests of 84 and a model output of 86. Figure 3 As can be seen from the data, the dense small-target pest counting model proposed in this invention has good counting accuracy in the target direction of dense small-target pests.
[0058] Table 1 shows the counting metrics for different pests using the dense small-target pest counting model, with mean absolute error (MAE) and mean squared error (RMSE) used as test metrics. As can be seen from Table 1, the dense small-target pest counting model proposed in this invention has excellent results for different pest species.
[0059] Table 1: Counting Indicators of Dense Small Target Pest Counting Model
[0060] Example 2
[0061] This embodiment provides a computer terminal, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, it implements the steps of the dense small target pest counting method based on a two-stage visual-language model as described in Embodiment 1.
[0062] like Figure 4 As shown, the computer terminal provided in this embodiment includes: at least one processor 101, and a memory 102 connected to at least one processor 101. This embodiment does not limit the specific connection medium between the processor 101 and the memory 102. Figure 4 The example shown is the connection between processor 101 and memory 102 via bus 100. Bus 100 is... Figure 4 The connections between other components are shown in bold lines and are for illustrative purposes only, not as limiting information. Bus 100 can be divided into address bus, data bus, control bus, etc., for ease of representation. Figure 4 The bus is represented by a single thick line, but this does not indicate that there is only one bus or one type of bus. Alternatively, the processor 101 may also be called a controller; there is no restriction on the name.
[0063] In this embodiment, the memory 102 stores instructions that can be executed by at least one processor 101. The at least one processor 101 can execute the aforementioned method by executing the instructions stored in the memory 102.
[0064] The processor 101 is the control center of the device. It can connect to various parts of the control device through various interfaces and lines. By running or executing instructions stored in memory 102 and calling data stored in memory 102, the processor can perform various functions and process data, thereby monitoring the device as a whole.
[0065] In one possible design, processor 101 may include one or more processing units. Processor 101 may integrate an application processor and a modem processor, wherein the application processor mainly handles the operating system, user interface, and applications, and the modem processor mainly handles wireless communication. It is understood that the modem processor may also not be integrated into processor 101. In some embodiments, processor 101 and memory 102 may be implemented on the same chip; in some embodiments, they may also be implemented on separate chips.
[0066] Processor 101 can be a general-purpose processor, such as a central processing unit (CPU), digital signal processor, application-specific integrated circuit, field-programmable gate array or other programmable logic device, discrete gate or transistor logic device, or discrete hardware component, capable of implementing or executing the methods, steps, and logic block diagrams disclosed in the embodiments. The general-purpose processor can be a microprocessor or any conventional processor. The steps of the dense small-target pest counting method based on a two-stage vision-language model disclosed in Embodiment 1 can be directly implemented by the hardware processor, or implemented by a combination of hardware and software modules in processor 101.
[0067] Memory 102, as a non-volatile computer-readable storage medium, can be used to store non-volatile software programs, non-volatile computer-executable programs, and modules. Memory 102 may include at least one type of storage medium, such as flash memory, hard disk, multimedia card, card-type memory, random access memory (RAM), static random access memory (SRAM), programmable read-only memory (PROM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), magnetic storage, magnetic disk, optical disk, etc. Memory 102 can be any other medium capable of carrying or storing desired program code in the form of instructions or data structures that can be accessed by a computer, but is not limited thereto. In this embodiment, memory 102 can also be a circuit or any other device capable of implementing storage functions for storing program instructions and / or data.
[0068] By designing and programming the processor 101, the code corresponding to the dense small-target pest counting method based on a two-stage vision-language model described in the foregoing embodiments can be embedded into the chip, thereby enabling the chip to execute the code during runtime. Figure 1 The steps of the dense small-target pest counting method based on a two-stage vision-language model are shown. How to design and program the processor 101 is a technique well-known to those skilled in the art and will not be described further here.
[0069] Example 3
[0070] This embodiment provides a computer-readable storage medium storing a computer program thereon. When the program is executed by a processor, it implements the steps of the dense small target pest counting method based on a two-stage vision-language model as described in Embodiment 1.
[0071] The computer-readable storage medium may include flash memory, hard disk, multimedia card, card-type memory (e.g., SD or DX memory), random access memory (RAM), static random access memory (SRAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), programmable read-only memory (PROM), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the storage medium may be an internal storage unit of a computer device, such as the hard disk or memory of the computer device. In other embodiments, the storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, smart memory card, secure digital card, flash memory card, etc., provided on the computer device. Of course, the storage medium may include both internal storage units and external storage devices of the computer device. In this embodiment, the memory is typically used to store the operating system and various application software installed on the computer device. In addition, the memory can also be used to temporarily store various types of data that have been output or will be output.
[0072] The above description is only a preferred embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any equivalent substitutions or modifications made by those skilled in the art within the scope of the technology disclosed in the present invention, based on the technical solution and inventive concept of the present invention, should be covered within the scope of protection of the present invention.
Claims
1. A method for counting dense small-target pests based on a two-stage vision-language model, characterized in that, Includes the following steps: S1. Collect pest images and perform block-level annotation to construct a dense dataset of small-target pest images; S2. Construct a dense small-target pest counting model, which includes an image encoder, a text encoder, a feature fusion module, a density map generator, a projector, and a large language model. The image encoder receives images of dense small-target pests and extracts high-resolution visual features that retain spatial details. The text encoder extracts semantic features from input text prompts. The feature fusion module calculates the similarity between visual features and text semantic features and fuses them to output multimodal features. The density map generator maps the multimodal features to a pest density map and counting results. The projector establishes a mapping relationship between visual features and the language semantic space. The large language model generates semantic descriptions based on the mapped visual features, and semantic supervision assists in model training. S3. A two-stage strategy is adopted to train the dense small target pest counting model. The first stage is to perform visual-linguistic modality pre-alignment training, and the second stage is to perform multi-task joint end-to-end training. S4. Acquire the image to be detected and preprocess it, input it into the trained dense small target pest counting model, and output the pest density map and counting results through the density map generator.
2. The method for counting dense small-target pests based on a two-stage visual-language model according to claim 1, characterized in that, In step S2, the image encoder is built based on the ResNet-50 architecture and consists of four convolutional stages. During the improvement process, the global average pooling layer and fully connected projection layer at the end of the original network were removed to preserve high-resolution spatial feature information. The specific processing of the image encoder is as follows: the input image with a size of 224×224×3 first passes through a convolutional layer with a kernel size of 7×7 and a stride of 2, then through a max pooling layer with a kernel size of 3×3 and a stride of 2, and then passes through the four convolutional stages in sequence to finally obtain visual features with a spatial size of 1 / 32 of the original image and 2048 channels.
3. The method for counting dense small-target pests based on a two-stage visual-language model according to claim 2, characterized in that, In step S2, the text encoder uses a pre-trained CLIP text encoder, and all parameters of the text encoder remain frozen throughout the training process; the large language model uses the LLaVA-OneVision-0.5B model, which generates image-level and region-level semantic descriptions during the training phase and discards them during the inference phase to improve computational efficiency.
4. The method for counting dense small-target pests based on a two-stage visual-language model according to claim 3, characterized in that, The projector employs a two-layer multilayer perceptron structure; the first layer of the multilayer perceptron maps 2048-dimensional visual features to 1024 dimensions using the GELU activation function; the second layer of the multilayer perceptron further maps the 1024-dimensional features to 512 dimensions, consistent with the input dimension of the LLaVA-OneVision-0.5B model.
5. The method for counting dense small-target pests based on a two-stage visual-language model according to claim 3, characterized in that, The specific operation process of the feature fusion module is as follows: the 2048-dimensional visual features output by the image encoder are reduced to 512-dimensional through 1×1 convolution, while the text semantic features output by the text encoder are obtained. Then, the softmax method is used for feature fusion.
6. The method for counting dense small-target pests based on a two-stage visual-language model according to claim 1, characterized in that, The density generator employs an average interval counting strategy, using the average count of each interval as its representative count point. The calculation formula is as follows: In the formula, Indicates the first each interval The representative counting point, , The number of predefined intervals; Representing an interval The cardinality; This represents the total number of blocks in the dataset. For indicator functions, Indicates the first The count value in each block; for any , interval and The intersection of them is not empty, and ,in The support set representing the count values, Represents the union; each interval Predicted density map The calculation formula is: In the formula, Let represent the probability score of the p-th interval; For the first each interval The representative point.
7. The method for counting dense small-target pests based on a two-stage visual-language model according to claim 1, characterized in that, In step S3, the pre-alignment training in the first stage specifically involves: freezing the image encoder and the large language model, and training only the parameters of the projector; using image-description pairs as training data, and minimizing the following loss function. Optimize: In the formula, Indicates the length of the descriptive text; Indicates the first Each word element, Indicates the preceding A sequence of lexical units; This represents the visual feature vector aligned by the projector. This represents the predicted probability of the current word given visual features and preceding words.
8. The method for counting dense small-target pests based on a two-stage visual-language model according to claim 7, characterized in that, In step S3, the multi-task joint end-to-end training in the second stage specifically involves: unfreezing the image encoder so that it participates in training together with the projector, feature fusion module, and density map generator; during the forward propagation process, the visual features extracted by the image encoder are simultaneously input into the following three branches: Counting branch: Outputs a predicted density map through the feature fusion module and density map generator; Image description branch: Generates image-level semantic descriptions through projectors and large language models; Region description branch: Extract local region features and generate region-level descriptions through a large language model; Employing a multi-task joint loss function Optimize: In the formula, For location-aware cross-entropy loss, ; The block classification cross-entropy loss ensures that each image block is correctly classified into the corresponding counting interval; For density map counting loss, the total predicted count is constrained to match the true value; , , , All are weighting coefficients; For image-level description loss, ; This represents the t-th term in the image-level description loss; This represents the first t-1 terms of the image-level description loss; To describe the loss at the regional level, In the formula For the number of regions, For the first The length of the region description This represents the t-th term in the region-level descriptive loss; The first t-1 terms represent the loss at the regional level.
9. A computer terminal, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the computer program, it implements the steps of the dense small target pest counting method based on a two-stage vision-language model as described in any one of claims 1 to 8.
10. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the program is executed by the processor, it implements the steps of the dense small target pest counting method based on a two-stage vision-language model as described in any one of claims 1 to 8.