A honeybee counting method and device based on lightweight density regression and semi-supervised learning

By constructing the LBCNet network model and combining lightweight density regression and semi-supervised learning, the problems of occlusion and background interference in bee counting in complex scenarios were solved, achieving accuracy and efficiency in bee counting and meeting the needs of large-scale, digital beekeeping management.

CN122244807APending Publication Date: 2026-06-19JIANGXI APICULTURE RES INST

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
JIANGXI APICULTURE RES INST
Filing Date
2026-05-22
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing bee image counting methods are easily affected by occlusion and background interference in natural scenes such as hive entrances, comb surfaces, and bee colony gathering areas, leading to missed detections, duplicate detections, and unstable positioning, making it difficult to meet the needs of large-scale, digital, and refined beekeeping management.

Method used

We construct an LBCNet network model using lightweight density regression and semi-supervised learning. Through data preprocessing and multi-scale context encoding, attention-guided low-level feature fusion, and density regression and counting loss optimization, we achieve bee counting.

Benefits of technology

Achieving accuracy and efficiency in bee counting under complex backgrounds and varying lighting conditions reduces the cost of manual labeling and enhances the level of intelligence in bee colony monitoring and management.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122244807A_ABST
    Figure CN122244807A_ABST
Patent Text Reader

Abstract

This application discloses a bee counting method and device based on lightweight density regression and semi-supervised learning, belonging to the field of deep learning technology. By preprocessing and dividing bee colony images from scenes such as the hive entrance and comb surface, the standardization of training data is ensured, laying the foundation for accurate counting. The constructed lightweight LBCNet network is adaptable to real-world beekeeping scenarios with dense occlusion, overlapping small objects, complex backgrounds, and variable lighting, balancing counting accuracy and operational efficiency, solving the problem that existing models are difficult to adapt to natural beekeeping scenarios. Simultaneously, by combining the characteristics of semi-supervised learning, the training difficulty of limited labeled samples is effectively alleviated, enabling efficient model training without extensive manual labeling. Ultimately, real-time accurate bee counting is achieved, meeting the needs of large-scale, digital, and refined beekeeping management, significantly reducing beekeeping management costs, and improving the intelligence level of bee colony monitoring and management.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of deep learning technology, specifically to a bee counting method and device based on lightweight density regression and semi-supervised learning. Background Technology

[0002] Bee population is a crucial indicator for evaluating colony activity, reproductive status, pollination capacity, and apiary management effectiveness. In scenarios such as hive entrance monitoring, estimating bee counts on comb surfaces, analyzing colony aggregation, and evaluating pollination operations, rapid, accurate, and repeatable bee counts in images are typically required. Traditional statistical methods rely primarily on manual observation or experience-based estimation, which is heavily influenced by operator experience, observation angle, colony activity, and ambient lighting conditions. These methods suffer from high labor intensity, low efficiency, strong subjectivity, and poor repeatability, making them unsuitable for the demands of large-scale, digitalized, and refined beekeeping management.

[0003] With the development of computer vision and deep learning technologies, image-based methods for bee recognition, detection, and counting are increasingly being used in smart beekeeping. For example, Chinese patent CN117994634A discloses an improved lightweight visual detection method for bee pollination, comprising: acquiring a dataset of bee pollination images and preprocessing them; building an improved lightweight YOLOv5s convolutional neural network model; training the improved lightweight YOLOv5s convolutional neural network model; and inputting the bee pollination image to be tested into the trained YOLOv5s convolutional neural network model to obtain the bee pollination category and location information. This method can achieve high detection accuracy for bee pollination images in complex environmental backgrounds. However, in scenarios with dense bee occlusion, individual bee adhesion, and complex backgrounds, it is prone to missed detections, duplicate detections, and inaccurate localization.

[0004] For example, Chinese patent CN118918444B discloses a bee detection method, system, computer, and storage medium. It acquires bee image data, establishes a neural network model, and combines YOLOv8 and Transformer encoders for feature extraction and processing to ultimately identify the number and location of individual bees in the image. This scheme mainly belongs to a target-by-target recognition and counting method based on target detection and feature encoding, suitable for locating and detecting individual bees. However, in scenarios with a large number of bees, small individual scale, severe mutual occlusion, unclear bee body boundaries, and similar beehive textures to bee body textures, the target-by-target detection method is prone to problems such as missed detections, duplicate detections, and unstable localization, thus affecting the final counting accuracy.

[0005] It can be seen that existing bee image counting methods mainly focus on detection and recognition or monitoring of fixed beehive systems. Especially in natural scenes such as beehive entrances, comb surfaces, and bee colony gathering areas, bees often exhibit characteristics such as small scale, large numbers, large posture variations, significant mutual occlusion, complex background textures, and frequent changes in lighting. If only conventional target detection networks are used, they are easily affected by occlusion and background interference. Summary of the Invention

[0006] In order to solve the above-mentioned technical problems, this application proposes the following technical solution: In a first aspect, embodiments of this application provide a bee counting method based on lightweight density regression and semi-supervised learning, including: Images of bee colonies collected from the entrance of beehives, the surface of bee combs, or the bee colony gathering area were preprocessed and divided into training set, validation set, and test set; A lightweight bee counting network model, LBCNet, is constructed. The training set is used to train LBCNet, and the model is validated and tested using the validation set and the test set. The trained LBCNet is used to process real-time bee colony images to count bees.

[0007] In one possible implementation, the preprocessing of images of the beehive entrance, beehive surface, or bee colony gathering area and the division into training, validation, and test sets includes: By adjusting the shooting angle, distance, and background conditions, images of bee colonies in three typical scenarios—sparse, medium, and high density—were obtained, covering different perspectives of the hive entrance, the surface of the comb, and the bee colony gathering area. The acquired bee colony images are mapped to the image pixel coordinate system using manual point annotations, and points with inconsistent scales are corrected and out-of-bounds annotation points are removed to ensure that the annotations are strictly aligned with the image space. Then, a true density map is constructed based on the normalized point annotations, so that the integral result of the entire density map is consistent with the total number of annotation points; Finally, the preprocessed bee colony images are divided into training, validation, and test sets according to a preset ratio for model training, parameter tuning, and model performance evaluation.

[0008] In one possible implementation, the LBCNet includes a student network and a teacher network. During model training, the teacher network is updated via EMA. The teacher network and the student network have the same network structure, both including: The system consists of a lightweight backbone network, a multi-scale context encoding module (MSCE), an attention-guided low-level feature fusion module (AGLF), and a density regression head. The Backbone receives the input image and outputs low-level features and high-level features. The low-level features are used to preserve the edge, local texture and fine-grained structural information of the bee, while the high-level features are used to characterize the semantic and contextual information of the bee colony distribution area. The high-level features are input into the MSCE, and the high-level features are modeled in parallel through multiple branches with different receptive fields to output multi-scale contextual features. After obtaining the multi-scale contextual features, AGLF is used to enhance them and fuse them with the low-level features to output the fused features; The Head outputs a single-channel non-negative density map based on the fusion features, and the single-channel non-negative density map is used for bee counting.

[0009] In one possible implementation, the backbone is a lightweight convolutional network structure. The shallower layer outputs of the network structure serve as low-level features, which have richer edge textures and local geometric information, and are used to recover local responses in dense scenes with small objects. The deeper layer outputs serve as high-level features, which have stronger semantic abstraction capabilities and are used to encode the global distribution pattern of the bee colony.

[0010] In one possible implementation, the MSCE is composed of a hollow spatial pyramid pooling branch ASPP and a multi-scale dilated convolution branch MSDC connected in series. ASPP is used to perform large receptive field context modeling for high-level features, and MSDC is used to further enhance the representation of local dense regions and fine-grained geometric structures. The output of ASPP is: in, This represents the high-level characteristics after mapping through the backbone network and channels. , , Indicates different void ratios, This indicates a channel splicing operation. and These represent convolution operations; The MSDC further enhances the colony geometry at different local scales: in: r1、r2、r3 This indicates different expansion rates; Cat indicates feature concatenation. This indicates a fusion convolution operation. This represents multi-scale contextual features.

[0011] In one possible implementation, the enhancement using AGLF and the fusion with the low-level features to output fused features includes: First, a coordinate attention mechanism is used to aggregate and encode the height and width directions of the feature map, generating corresponding directional attention maps. The features enhanced by coordinate attention are as follows: in, This represents the output features of the multi-scale context encoding module. and These represent attention maps generated along the height and width directions, respectively. Then, the attention-enhanced high-level features Channel projection and upsampling are performed to obtain high-level enhanced features. : At the same time, for low-level features Perform channel projection to obtain the projected low-level features. : Finally, the high-level enhancement features Low-level features after projection The data is concatenated along the channel dimension, and the final fused features are obtained through a feature refinement function. : in: and These represent the one-to-one convolution projection operations after batch normalization and activation function processing, respectively. This represents a feature refinement function consisting of two layers of 3x3 convolutions. This indicates a coordinate attention operation. This indicates a low-level feature fusion operation. This indicates the final fusion feature.

[0012] In one possible implementation, the Head outputs a single-channel non-negative density map based on the fusion features. This single-channel non-negative density map is used for bee counting, including: In obtaining fusion characteristics Then, Head outputs a single-channel non-negative density map based on the fusion features: in: and These represent convolution operations. Represents a non-linear activation function. This represents the non-negativity-constrained activation function. Represents the predicted density map; Then, by integrating all pixel values ​​of the density map, the number of bees in the current image is obtained: in, Indicates the density map at location ( The estimated value at that location, and These represent the width and height of the density map, respectively.

[0013] In one possible implementation, during the LBCNet training phase, a ground truth density map is generated based on manually labeled points in the input image. The model parameters are optimized using a combination of density regression loss and counting loss. Density regression loss is defined as: The counting loss is defined as: in: The number of labeled samples in a batch. This is the predicted density map for the b-th sample. This is the true density map of the b-th sample; This represents the predicted count value of the b-th sample. This represents the true count value of the b-th sample; Weights for the counting loss of the b-th sample This represents a constant to prevent the denominator from being zero. The total loss from manual annotation is defined as: in, This represents the counting loss weighting coefficient.

[0014] In one possible implementation, during the LBCNE training phase, for unlabeled samples, the teacher network inputs a weakly enhanced image, and the student network inputs the corresponding strongly enhanced image, thereby obtaining prediction density maps. and ; A confidence region mask M is constructed based on the high-response regions in the teacher network prediction density map, and density consistency constraints are applied only within the high-confidence regions: in: Unlabeled samples In spatial location The confidence region mask value at that location. For student networks in unlabeled samples In spatial location The predicted density value at that location. For teacher networks in unlabeled samples In spatial location The predicted density value at the location; Construct a counting consistency constraint based on the integral counting results of the teacher network and the student network: in, and These represent the predicted counts output by the student network and the teacher network, respectively. The consistency loss of unlabeled samples can be expressed as: The total loss can be expressed as: in: The weight function gradually increases with the training round e; Teacher network parameters are updated using an exponential moving average of student network parameters: in: and These represent the teacher's network parameters and the student's network parameters, respectively. This is the exponential moving average decay coefficient.

[0015] Secondly, embodiments of this application provide a bee counting device based on lightweight density regression and semi-supervised learning, comprising: an image acquisition device and a data processing device. The image acquisition device is embedded with a display interaction module, and the image acquisition device is communicatively connected to the data processing device. The image acquisition device is used to acquire images of the beehive entrance, the surface of the beehive comb, and the bee colony gathering area. After transmitting the images to the data processing device, the data processing device executes the method described in any possible implementation of the first aspect of the claim to achieve bee counting.

[0016] In this embodiment, preprocessing and partitioning bee colony images of scenes such as hive entrances and comb surfaces ensures the standardization of training data, laying the foundation for accurate counting. The constructed lightweight LBCNet network is adaptable to real-world beekeeping scenarios with dense occlusion, overlapping small targets, complex backgrounds, and variable lighting, balancing counting accuracy and operational efficiency, thus solving the problem of existing models being difficult to adapt to natural beekeeping scenarios. Simultaneously, by combining semi-supervised learning characteristics, the training challenge of limited labeled samples is effectively alleviated, enabling efficient model training without extensive manual labeling. Ultimately, this achieves real-time accurate bee counting, meeting the needs of large-scale, digital, and refined beekeeping management, significantly reducing beekeeping management costs, and improving the intelligence level of bee colony monitoring and management. Attached Figure Description

[0017] Figure 1 A schematic diagram of a bee counting method based on lightweight density regression and semi-supervised learning provided for an embodiment of this application; Figure 2 This is a schematic diagram of the LBCNet architecture provided in an embodiment of this application; Figure 3 This is a schematic diagram of the MSCE architecture provided in an embodiment of this application; Figure 4 This is a schematic diagram of the AGLF architecture provided in an embodiment of this application; Figure 5 This is a schematic diagram of a semi-supervised training process provided in an embodiment of this application; Figure 6 A visual comparison diagram of counting results under different annotation ratios provided in the embodiments of this application; Figure 7 A comparative schematic diagram of the original image, density heat map, and overlay image provided in the embodiments of this application; Figure 8 A schematic diagram showing the comparison of ablation experiment counts under different module configurations provided in the embodiments of this application; Figure 9 A front view of a bee counting device based on lightweight density regression and semi-supervised learning provided in an embodiment of this application; Figure 10 The rear view of a bee counting device based on lightweight density regression and semi-supervised learning provided in an embodiment of this application. Detailed Implementation

[0018] The present solution will now be described in conjunction with the accompanying drawings and specific embodiments.

[0019] See Figure 1 The bee counting method based on lightweight density regression and semi-supervised learning provided in this embodiment includes: S101, preprocess the collected images of bee hive entrances, bee comb surfaces, or bee colony gathering areas and divide them into training sets, validation sets, and test sets.

[0020] The dataset for this embodiment was collected from the Shandong Agricultural University Science and Technology Innovation Park. Bee activity scenes were captured on-site using a mobile terminal device under natural lighting conditions. During data collection, periods of stable weather and low wind speed were chosen to minimize the impact of image blur and motion blur on the counting task.

[0021] By adjusting the shooting angle, distance, and background conditions, images of bee colonies in three typical scenarios—sparse, medium-density, and high-density—were acquired, covering different perspectives such as the hive entrance, comb surface, colony aggregation area, edge region, and overall frame region. The final dataset contains 586 high-resolution images and 34,869 fine-point annotations, which can be used to verify the counting robustness of the model under different crowding levels, background complexities, and shooting distances.

[0022] For the acquired bee colony images, for labeled samples, the manually added point annotations are mapped to the image pixel coordinate system. Points with inconsistent scales are corrected, and out-of-bounds annotations are removed to ensure strict alignment between the annotations and the image space. Then, a true density map is constructed based on the normalized point annotations. Preferably, an adaptive Gaussian kernel is used to generate a density supervision map based on nearest neighbor distances, and further mass conservation correction is performed to ensure that the integral result of the entire density map is consistent with the total number of labeled points.

[0023] During the training phase, at least one of the following data augmentation operations can be performed: random cropping, horizontal flipping, brightness perturbation, contrast perturbation, color perturbation, and random Gaussian blur, to improve the model's adaptability to different lighting, background, and cluster density variations.

[0024] Finally, the preprocessed dataset was divided into training, validation, and test sets in a 7:2:1 ratio for model training, parameter tuning, and final performance evaluation. For semi-supervised training experiments, the training set was further divided into labeled and unlabeled subsets to simulate real-world application scenarios with low labeling ratios.

[0025] S102, construct a lightweight bee counting network model LBCNet, train the LBCNet using the training set, and validate and test the model using the validation set and test set.

[0026] join Figure 2 The LBCNet constructed in this embodiment includes a student network and a teacher network. The teacher network is updated through EMA during model training. The network structures of the teacher network and the student network are the same, both including: a lightweight backbone network, a multi-scale context coding module MSCE, an attention-guided low-level feature fusion module AGLF, and a density regression head.

[0027] Backbone receives the input image and outputs low-level and high-level features. Low-level features preserve the edges, local texture, and fine-grained structural information of the bees, while high-level features characterize the semantic and contextual information of the bee colony distribution area. In this embodiment, Backbone employs MobileNetV3-Large or a lightweight convolutional network structure with similar complexity, where shallower layer outputs serve as low-level features and deeper layer outputs serve as high-level features.

[0028] In one implementation, the fourth-stage features can be selected as low-level features, and the seventh-stage features as high-level features. The high-level features are further mapped to a uniform channel representation via convolution. The low-level features are then fed into the subsequent multi-scale context encoding module; the low-level features are retained for subsequent fusion to enhance the edge detail recovery capability of the density map prediction.

[0029] The aforementioned backbone structure design, on the one hand, utilizes shallower features with richer edge textures and local geometric information, making it suitable for recovering local responses in densely packed scenes with small objects. On the other hand, deeper features possess stronger semantic abstraction capabilities, making them suitable for encoding global distribution patterns of the bee colony. Through the synergy of these two types of features, both global statistics and local detail recovery can be achieved.

[0030] The high-level features are input into the MSCE, and multi-scale contextual features are output through parallel modeling of multiple branches with different receptive fields. See also Figure 3 MSCE is composed of the hollow spatial pyramid pooling branch ASPP and the multi-scale dilated convolution branch MSDC. ASPP is used to perform large receptive field context modeling for high-level features, while MSDC is used to further enhance the representation of local dense regions and fine-grained geometric structures. The output of ASPP is: in, This represents the high-level characteristics after mapping through the backbone network and channels. , , Indicates different void ratios, This indicates a channel splicing operation. and These represent convolution operations. Using this structure, the receptive field can be expanded without significantly increasing the number of model parameters, thus improving the ability to model the global distribution patterns of bee colonies.

[0031] In ASPP output Building upon this foundation, the MSDC further enhances the colony geometry at different local scales: in: r1、r2、r3 This indicates different expansion rates; Cat indicates feature concatenation. This indicates a fusion convolution operation. This structure represents multi-scale contextual features. Compared to using only single-scale convolution, it can simultaneously take into account the neighborhood relationships between closely packed bees and the density variations of bee colonies over a larger area, thereby improving the sufficiency of feature representation in dense scenes.

[0032] After obtaining multi-scale contextual features, AGLF is used to enhance them and fuse them with the low-level features to output fused features. See also Figure 4 First, a coordinate attention mechanism is used to aggregate and encode the height and width directions of the feature map, generating corresponding directional attention maps. The features enhanced by coordinate attention are as follows: in, This represents the output features of the multi-scale context encoding module. and These represent attention maps generated along the height and width directions, respectively. This operation enhances the spatial response related to bee distribution while suppressing irrelevant responses from the background region.

[0033] Then, the attention-enhanced high-level features Channel projection and upsampling are performed to obtain high-level enhanced features. : At the same time, for low-level features Perform channel projection to obtain the projected low-level features. : Through the above design, high-level features can be effectively aligned and fused with low-level edge texture information after upsampling.

[0034] Finally, the high-level enhancement features Low-level features after projection The data is concatenated along the channel dimension, and the final fused features are obtained through a feature refinement function. : in: and These represent the one-to-one convolution projection operations after batch normalization and activation function processing, respectively. This represents a feature refinement function consisting of two layers of 3x3 convolutions. This indicates a coordinate attention operation. This indicates a low-level feature fusion operation. This represents the final fused feature. By combining coordinate attention with low-level feature fusion, the model's local discriminative ability can be improved under conditions of blurred bee edges, small scale, and complex backgrounds.

[0035] The Head outputs a single-channel non-negative density map based on the fusion features, and this single-channel non-negative density map is used for bee counting. (This is in the process of obtaining the fusion features.) Then, Head outputs a single-channel non-negative density map based on the fusion features: in: and These represent convolution operations. Represents a non-linear activation function. This represents the non-negativity-constrained activation function. This represents the predicted density map.

[0036] Then, by integrating all pixel values ​​of the density map, the number of bees in the current image is obtained: in, Indicates the density map at location ( The estimated value at that location, and These represent the width and height of the density map, respectively. The above design can indirectly estimate the total number of bees by learning the spatial distribution density of bees, avoiding the error accumulation problem caused by individual detection in densely occluded scenes.

[0037] In this embodiment, the density map output by the density regression head has a lower resolution than the original input image, thereby further reducing the computational load. Simultaneously, low-level feature fusion compensates for detail loss, ensuring the localization and response capabilities of the small target bee region.

[0038] To ensure the training performance of LBCNet, a true density map is generated based on manually labeled points in the input image. The model parameters are optimized using a combination of density regression loss and counting loss. Density regression loss is defined as: The counting loss is defined as: in: The number of labeled samples in a batch. This is the predicted density map for the b-th sample. This is the true density map of the b-th sample; This represents the predicted count value of the b-th sample. This represents the true count value of the b-th sample; Weights for the counting loss of the b-th sample This represents a constant to prevent the denominator from being zero. The total loss from manual annotation is defined as: in, This represents the weighting coefficient of the counting loss. Through the above joint optimization method, the model learns both the local density distribution and strengthens the global counting consistency, thereby improving the overall counting stability.

[0039] See Figure 5 For unlabeled samples, the teacher network inputs a weakly enhanced image, and the student network inputs the corresponding strongly enhanced image, resulting in predicted density maps. and A confidence region mask M is constructed based on the high-response regions in the teacher network prediction density map, and density consistency constraints are applied only within the high-confidence regions. in: Unlabeled samples In spatial location The confidence region mask value at that location. For student networks in unlabeled samples In spatial location The predicted density value at that location. For teacher networks in unlabeled samples In spatial location The predicted density value at the location; Construct a counting consistency constraint based on the integral counting results of the teacher network and the student network: in, and These represent the predicted counts output by the student network and the teacher network, respectively. The consistency loss of unlabeled samples can be expressed as: The total loss can be expressed as: in: The weight function gradually increases with each training round e; a warm-up strategy is adopted so that the training initially relies mainly on labeled supervision, and the consistency constraints of unlabeled samples are gradually increased after the teacher's network output tends to stabilize.

[0040] Teacher network parameters are updated using an exponential moving average of student network parameters: in: and These represent the teacher's network parameters and the student's network parameters, respectively. This is the exponential moving average decay coefficient.

[0041] In this embodiment, the confidence region mask is adaptively generated based on the high quantile response regions within the sample of the teacher network prediction density map. For example, the quantile parameter q can be set to 0.90 to utilize only the high response regions in the teacher predictions for consistency learning, and the minimum effective mask ratio can be further set to 0.02 to avoid the unlabeled supervised regions being too sparse.

[0042] This example provides a feasible training configuration. The experimental environment can use Ubuntu 20.04 operating system, PyTorch 1.11.0, CUDA 11.3, and Python 3.8. The hardware platform can include an RTX 4090D GPU, an Intel Xeon Platinum 8474C CPU, and 80GB of memory. The model is trained for 200 epochs with a batch size of 8 and an initial learning rate of... The optimizer uses AdamW with weight decay of 1 / 2. In semi-supervised training, the teacher network is updated via EMA with a decay coefficient of 0.999. The weights of the density consistency term and the count consistency term can be 0.05 and 0.01, respectively; the warm-up phase can be set to 40 rounds. The training images are first scaled proportionally and then randomly cropped to 768×768; the validation phase uses 768×768 sliding window inference with a step size of 384.

[0043] To highlight the application value of this embodiment in reducing annotation costs, the training set was set to annotation ratios of 10%, 30%, and 50%, respectively, and compared with representative semi-supervised counting methods. The comparison results are shown in Table 1.

[0044] Table 1. Comparison results of different models under different annotation ratios As shown in Table 1, under a 10% annotation ratio, the model in this application achieved an MAE of 10.134 and an RMSE of 13.489. In comparison, the MAE and RMSE of the Dream method were 11.597 and 14.350, respectively; the Calibrating method was 14.616 and 19.995, respectively; and the MTCP was 14.114 and 17.286, respectively. This invention demonstrates superior counting accuracy and better error control even at a low annotation ratio.

[0045] With a 30% annotation ratio, the model described in this invention achieves an MAE of 7.191 and an RMSE of 9.820; Dream scores are 9.915 and 14.329, Calibrating scores are 8.565 and 12.223, and MTCP scores are 8.010 and 10.417, respectively. It is evident that the proposed solution maintains optimal performance as the annotation ratio increases.

[0046] With a 50% annotation ratio, the model described in this invention achieves an MAE of 6.605 and an RMSE of 9.589; Dream, Calibrating, and MTCP still lag behind the proposed solution. These results demonstrate that this invention maintains a strong accuracy advantage even with an increased annotation ratio, while also showcasing the effectiveness of semi-supervised strategies in utilizing unlabeled data.

[0047] The trend shows that as the annotation ratio increases from 10% to 30%, the error decreases significantly; when it continues to increase from 30% to 50%, the error decreases further, but the gain weakens relatively. This indicates that the model in this embodiment can achieve high counting stability under medium to low annotation ratios, which has practical significance in reducing the cost of manual annotation.

[0048] See Figure 6 In this embodiment, a test sample with a true count value of 135 is used as an example to visually compare the prediction results of each half-supervised method under different annotation ratios.

[0049] With a 10% annotation ratio, the prediction counts of Dream, Calibrating, MTCP, and LBCNet are 107, 106, 107, and 113, respectively, with corresponding relative errors of 20.74%, 21.48%, 20.74%, and 16.30%, among which LBCNet is closest to the true value.

[0050] With a 30% annotation ratio, the prediction counts of the four models are 114, 115, 118 and 122, respectively, with corresponding relative errors of 15.56%, 14.81%, 12.59% and 9.63%. LBCNet still has the smallest bias.

[0051] With a 50% annotation ratio, the predicted counts of the four models are 115, 119, 124, and 128, respectively, with corresponding relative errors of 14.81%, 11.85%, 8.15%, and 5.19%. It can be seen that LBCNet provides a more concentrated and stable colony response region under different annotation ratios, and its counting error is closer to the true value.

[0052] The results show that LBCNet not only outperforms the comparison models in terms of numerical metrics, but also demonstrates clearer colony region response, more adequate background noise suppression, and more complete coverage of local high-density regions in density heatmap visualization, making it suitable for practical visualization-assisted detection scenarios.

[0053] S103, The trained LBCNet is used to process the real-time acquired bee colony image to count the bees.

[0054] In this embodiment, the trained LBCNet was deployed on a self-developed portable handheld bee colony monitoring device for data collection and inference testing in a real beekeeping environment. During field testing, the device successfully collected samples from edge areas, the entire frame, and close-range bee colony aggregation areas, as well as dynamic images during actual inspections. This demonstrates the device's good portability and scene adaptability, and its ability to stably acquire images required for subsequent inference under natural lighting conditions.

[0055] Further, see Figure 7 (a) is the original input image, (b) is the predicted density heatmap, and (c) is the overlay image. It can be seen that they have good consistency. The high response area corresponds highly to the actual bee colony distribution area, indicating that the trained LBCNet can effectively focus on the bee colony area and suppress background interference in the real field background.

[0056] This verifies that the present embodiment possesses real-time counting capabilities and engineering feasibility in a real-world deployment scenario. The data storage module can also simultaneously save the detection time, image number, quantity results, and visualized images, facilitating subsequent verification and historical trend analysis.

[0057] It should be noted that, in addition to counting the entry and exit of beehives and assessing the surface density of bee combs, the method provided in this embodiment can also be extended to scenarios such as monitoring bee colony activity, analyzing pollination operation status, tracking bee colony number trends, and digital management of beekeeping.

[0058] For example, in a hive entrance monitoring scenario, multiple frames of images can be acquired at fixed time intervals, and changes in the number of images can be statistically analyzed to characterize the activity level of bee colonies entering and exiting. In a comb surface monitoring scenario, the trend of bee colony coverage can be estimated by periodically photographing the same area. In a pollination operation scenario, the activity density of bee colonies within the pollination area can be recorded and compared by combining timestamps and location information.

[0059] For those skilled in the art, without departing from the core ideas of this application, the solutions of this application can be adapted to other dense, small-target agricultural scenarios, such as estimating the number of dense insects, larvae, floral organs, or similar point targets. All of the above variations fall under the category of scalable applications of this invention.

[0060] Furthermore, this embodiment also verifies the contribution of each key module in LBCNet to the performance improvement. Taking the model containing only the MobileNetV3 backbone network and the lightweight regression head as the baseline, it is compared with the complete model that adds the multi-scale context encoding module MSCE, the model that only adds the attention-guided low-level feature fusion module AGLF, and the complete model that adds both. The results are shown in Table 2.

[0061] Table 2 Comparison of module performance in LBCNet As shown in Table 2, the baseline Backbone achieved a MAE of 7.064 and an RMSE of 9.194 on the test set. After adding MSCE, the MAE and RMSE decreased to 6.455 and 8.761, respectively, representing reductions of 8.62% and 4.71% compared to the baseline. This indicates that multi-scale context modeling helps improve global representation capabilities and overall robustness.

[0062] After adding AGLF alone, MAE and RMSE further decreased to 5.922 and 7.501, respectively, a reduction of 16.17% and 18.41% relative to the baseline, indicating that coordinate attention and low-level feature fusion have more significant gains in local detail recovery, background suppression, and dense region discrimination.

[0063] When MSCE and AGLF are enabled simultaneously, the complete model achieves the best results with an MAE of 5.201 and an RMSE of 6.989. Compared with the baseline, MAE and RMSE are reduced by 26.37% and 23.98%, respectively; compared with the MSCE-only scheme, they are further reduced by 19.43% and 20.23%; and compared with the AGLF-only scheme, they are still further reduced by 12.17% and 6.83%.

[0064] This embodiment demonstrates that the MSCE module and the AGLF module are significantly complementary. The former focuses on enriching multi-scale contextual information, while the latter focuses on detail enhancement and target region strengthening. Combining the two can significantly improve counting accuracy and robustness.

[0065] See Figure 8 In the visualization results, as modules are gradually introduced, the response of the beehive region gradually becomes continuous from scattered, the target boundary is clearer, and the high response coverage of dense areas is more complete. The complete model performs best in both global structure preservation and local detail recovery.

[0066] Corresponding to the bee counting method based on lightweight density regression and semi-supervised learning provided in the above embodiments, this application also provides an embodiment of a bee counting device based on lightweight density regression and semi-supervised learning.

[0067] See Figure 9 and Figure 10 The bee counting device based on lightweight density regression and semi-supervised learning provided in this embodiment includes an image acquisition device and a data processing device. Specifically, the image acquisition device includes a device housing 1, an interactive display screen 2 embedded in the device housing 1, and a handle 3 fixedly connected to the device housing 1. A camera 4 is fixedly installed on the back of the interactive display screen 2. A communication connection cable 5 is provided inside the handle 3. One end of the communication connection cable 5 is communicatively connected to the image acquisition device, and the other end is communicatively connected to the data processing device. In this embodiment, the data processing device is a Jetson processing unit 6.

[0068] The images acquired by the image acquisition device are transmitted to the data processing device in real time. The data processing device uses the bee counting method based on lightweight density regression and semi-supervised learning in the above embodiment to count the number of bees in the honeycomb frame area.

[0069] In this application embodiment, "at least one" refers to one or more, and "more than one" refers to two or more. "And / or" describes the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent the existence of A alone, the simultaneous existence of A and B, or the existence of B alone. A and B can be singular or plural. The character " / " generally indicates that the preceding and following related objects are in an "or" relationship. "At least one of the following" and similar expressions refer to any combination of these items, including any combination of single or plural items. For example, at least one of a, b, and c can represent: a, b, c, ab, ac, bc, or abc, where a, b, and c can be single or multiple.

[0070] The above description is merely a specific embodiment of this application. Any variations or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in this application should be included within the protection scope of this application. The protection scope of this application should be determined by the protection scope of the claims.

Claims

1. A bee counting method based on lightweight density regression and semi-supervised learning, characterized in that, include: Images of bee colonies collected from the entrance of beehives, the surface of bee combs, or the bee colony gathering area were preprocessed and divided into training set, validation set, and test set; A lightweight bee counting network model, LBCNet, is constructed. The training set is used to train LBCNet, and the model is validated and tested using the validation set and the test set. The trained LBCNet is used to process real-time bee colony images to count bees.

2. The bee counting method based on lightweight density regression and semi-supervised learning according to claim 1, characterized in that, The process involves preprocessing the collected images of beehive entrances, comb surfaces, or bee colony gathering areas and dividing them into training, validation, and test sets, including: By adjusting the shooting angle, distance, and background conditions, images of bee colonies in three typical scenarios—sparse, medium, and high density—were obtained, covering different perspectives of the hive entrance, the surface of the comb, and the bee colony gathering area. The acquired bee colony images are mapped to the image pixel coordinate system using manual point annotations, and points with inconsistent scales are corrected and out-of-bounds annotation points are removed to ensure that the annotations are strictly aligned with the image space. Then, a true density map is constructed based on the normalized point annotations, so that the integral result of the entire density map is consistent with the total number of annotation points; Finally, the preprocessed bee colony images are divided into training, validation, and test sets according to a preset ratio for model training, parameter tuning, and model performance evaluation.

3. The bee counting method based on lightweight density regression and semi-supervised learning according to claim 1, characterized in that, The LBCNet comprises a student network and a teacher network. During model training, the teacher network is updated via EMA. The teacher network and the student network have the same network structure, both including: The system consists of a lightweight backbone network, a multi-scale context encoding module (MSCE), an attention-guided low-level feature fusion module (AGLF), and a density regression head. The Backbone receives the input image and outputs low-level features and high-level features. The low-level features are used to preserve the edge, local texture and fine-grained structural information of the bee, while the high-level features are used to characterize the semantic and contextual information of the bee colony distribution area. The high-level features are input into the MSCE, and the high-level features are modeled in parallel through multiple branches with different receptive fields to output multi-scale contextual features. After obtaining the multi-scale contextual features, AGLF is used to enhance them and fuse them with the low-level features to output the fused features; The Head outputs a single-channel non-negative density map based on the fusion features, and the single-channel non-negative density map is used for bee counting.

4. The bee counting method based on lightweight density regression and semi-supervised learning according to claim 3, characterized in that, The backbone is a lightweight convolutional network structure. The shallower layer outputs of the network structure serve as low-level features, which have richer edge textures and local geometric information, and are used to recover local responses in dense scenes with small objects. The deeper layer outputs serve as high-level features, which have stronger semantic abstraction capabilities and are used to encode the global distribution pattern of the bee colony.

5. The bee counting method based on lightweight density regression and semi-supervised learning according to claim 3 or 4, characterized in that, The MSCE is composed of a hollow spatial pyramid pooling branch ASPP and a multi-scale dilated convolution branch MSDC. ASPP is used to perform large receptive field context modeling for high-level features, and MSDC is used to further enhance the representation of local dense regions and fine-grained geometric structures. The output of ASPP is: in, This represents the high-level characteristics after mapping through the backbone network and channels. , , Indicates different void ratios, This indicates a channel splicing operation. and These represent convolution operations; The MSDC further enhances the colony geometry at different local scales: in: r1、r2、r3 This indicates different expansion rates; Cat indicates feature concatenation. This indicates a fusion convolution operation. This represents multi-scale contextual features.

6. The bee counting method based on lightweight density regression and semi-supervised learning according to claim 5, characterized in that, The process of enhancing the feature using AGLF and fusing it with the low-level feature to output a fused feature includes: First, a coordinate attention mechanism is used to aggregate and encode the height and width directions of the feature map, generating corresponding directional attention maps. The features enhanced by coordinate attention are as follows: in, This represents the output features of the multi-scale context encoding module. and These represent attention maps generated along the height and width directions, respectively. Then, the attention-enhanced high-level features Channel projection and upsampling are performed to obtain high-level enhanced features. : At the same time, for low-level features Perform channel projection to obtain the projected low-level features. : Finally, the high-level enhancement features Low-level features after projection The data is concatenated along the channel dimension, and the final fused features are obtained through a feature refinement function. : in: and These represent the one-to-one convolution projection operations after batch normalization and activation function processing, respectively. This represents a feature refinement function consisting of two layers of 3x3 convolutions. This indicates a coordinate attention operation. This indicates a low-level feature fusion operation. This indicates the final fusion feature.

7. The bee counting method based on lightweight density regression and semi-supervised learning according to claim 6, characterized in that, The Head outputs a single-channel non-negative density map based on the fusion features. This single-channel non-negative density map is used for bee counting, including: In obtaining fusion characteristics Then, Head outputs a single-channel non-negative density map based on the fusion features: in: and These represent convolution operations. Represents a non-linear activation function. This represents the non-negativity-constrained activation function. Represents the predicted density map; Then, by integrating all pixel values ​​of the density map, the number of bees in the current image is obtained: in, Indicates the density map at location ( The estimated value at that location, and These represent the width and height of the density map, respectively.

8. The bee counting method based on lightweight density regression and semi-supervised learning according to claim 7, characterized in that, During the LBCNet training phase, a true density map is generated based on manually labeled points in the input image. The model parameters are optimized using a combination of density regression loss and counting loss. Density regression loss is defined as: The counting loss is defined as: in: The number of labeled samples in a batch. This is the predicted density map for the b-th sample. This is the true density map of the b-th sample; This represents the predicted count value of the b-th sample. This represents the true count value of the b-th sample; Weights for the counting loss of the b-th sample This represents a constant to prevent the denominator from being zero. The total loss from manual annotation is defined as: in, This represents the counting loss weighting coefficient.

9. The bee counting method based on lightweight density regression and semi-supervised learning according to claim 8, characterized in that, During the LBCNE training phase, for unlabeled samples, the teacher network inputs a weakly enhanced image, and the student network inputs the corresponding strongly enhanced image, resulting in prediction density maps. and ; A confidence region mask M is constructed based on the high-response regions in the teacher network prediction density map, and density consistency constraints are applied only within the high-confidence regions: in: Unlabeled samples In spatial location The confidence region mask value at that location. For student networks in unlabeled samples In spatial location The predicted density value at that location, For teacher networks in unlabeled samples In spatial location The predicted density value at the location; Construct a counting consistency constraint based on the integral counting results of the teacher network and the student network: in, and These represent the predicted counts output by the student network and the teacher network, respectively. The consistency loss of unlabeled samples can be expressed as: The total loss can be expressed as: in: The weight function gradually increases with the training round e; Teacher network parameters are updated using an exponential moving average of student network parameters: in: and These represent the teacher's network parameters and the student's network parameters, respectively. This is the exponential moving average decay coefficient.

10. A bee counting device based on lightweight density regression and semi-supervised learning, characterized in that, include: An image acquisition device and a data processing device are provided. The image acquisition device is embedded with a display and interaction module and is communicatively connected to the data processing device. The image acquisition device is used to acquire images of the beehive entrance, the surface of the beehive comb, and the bee colony gathering area. After the images are transmitted to the data processing device, the data processing device executes the method described in any one of claims 1-9 to achieve bee counting.