A deep learning-based multi-object detection method for dairy cows

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By improving the YOLO11 model and introducing a high-resolution P2 detection branch, a receptive field adaptive convolution module RFAConv, and a large kernel separable attention module LSKA, the problems of multi-scale target detection and complex background interference in dairy cow detection are solved, achieving efficient and accurate multi-target detection, which is suitable for intelligent animal husbandry applications.

CN122244905APending Publication Date: 2026-06-19XINJIANG UNIVERSITY

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: XINJIANG UNIVERSITY
Filing Date: 2026-03-21
Publication Date: 2026-06-19

Application Information

Patent Timeline

21 Mar 2026

Application

19 Jun 2026

Publication

CN122244905A

IPC: G06V40/10; G06V10/25; G06V10/44; G06V10/80; G06V10/764; G06V10/82; G06V10/70; G06V10/42; G06V10/766; G06N3/0464; G06N3/045

AI Tagging

Application Domain

Biological models Biometric pattern recognition

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

A power distribution network voltage support evaluation method, system, device and medium based on generalized regulation resources
CN122225477ABiological models Ac network voltage adjustment
System(s) and method(s) for generative model processing of image data including object(s) having particular feature(s) and / or classification(s)
WO2026122857A1Biological models
Knowledge graph construction method and device, equipment and storage medium
CN119149753BImprove timing analysisImproving performance in directional reasoningBiological models Knowledge representation
QA system and method
US20260162247A1Programme control Image enhancement
Systems and methods for data collection in an industrial environment
US20260161153A1Machine part testing Receivers monitoring

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing technologies for dairy cow detection face challenges such as multi-scale target detection, complex background interference and occlusion, balancing model efficiency and accuracy, limitations of static receptive fields, and lack of domain adaptability optimization. These issues result in high false negative rates for calves, low positioning accuracy for adult cattle, and high computational costs, making it difficult to achieve efficient and accurate multi-target detection in complex natural pasture environments.

Method used

By introducing a high-resolution P2 detection branch, an adaptive receptive field convolution module RFAConv, and a large kernel separable attention module LSKA, the YOLO11 model is improved, enhancing the recall rate for extremely small targets and the localization accuracy for large targets, thereby improving the robustness and efficiency of the model in complex backgrounds.

Benefits of technology

It significantly improves the detection rate of extremely small targets and the localization accuracy of large targets, enhances the model's detection performance in complex scenarios, reduces false detection and false negative rates, and is suitable for deployment of edge devices in resource-constrained farms.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122244905A_ABST

Patent Text Reader

Abstract

This invention discloses a deep learning-based method for multi-target detection of dairy cows, comprising: acquiring images of dairy cows to be detected; inputting the images into an improved YOLO11 model, which introduces a high-resolution P2 detection branch at the neck of the feature pyramid, fusing shallow feature maps downsampled by 4 times from the backbone network with deep semantic features to generate a high-resolution feature map for detecting extremely small targets; introducing a receptive field adaptive convolution module, which dynamically generates spatial attention weights and feature bases through parallel sub-networks to achieve adaptive receptive field adjustment; integrating an LSKA module before the P5 detection layer, using large-size depthwise separable convolutions to capture global context, and performing feature recalibration through channel attention; and finally outputting the detection results of multi-scale dairy cow targets. This invention, through multi-module collaborative optimization, significantly improves the detection accuracy and robustness of multi-scale dairy cows in complex natural scenes, reduces the false negative rate, and can be applied to automated monitoring and counting in intelligent ranches.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the technical field of image recognition and analysis, specifically relating to a deep learning-based method for multi-target detection of dairy cows. Background Technology

[0002] In current large-scale dairy farming practices, rapid and accurate target detection of both the entire herd and individual cows is fundamental for assessing herd health, conducting feeding management, and optimizing farm utilization efficiency. However, traditional methods for counting and monitoring dairy cows largely rely on manual on-site inspections or contact sensors such as RFID. These methods are labor-intensive, difficult to achieve full-coverage real-time monitoring of large-scale farms, and prone to data loss or misinterpretation in complex scenarios such as dynamic grazing, herd gathering, and obstruction. Therefore, achieving contactless, automated, and high-precision target detection of individual dairy cows has become one of the core challenges in the field of smart farming.

[0003] In the relevant technical field, general object detection technologies based on deep learning have become increasingly mature. The YOLOv8 and YOLO11 models, continuously developed by the Ultralytics team, have achieved a good balance between speed and accuracy through the introduction of C2f modules and improved feature pyramid networks, becoming baseline models for many vision applications. However, when directly applied to complex natural pasture environments, they still face the following challenges:

[0004] 1) Multi-scale target detection problem: General detection models have uneven perception capabilities for dairy cow targets with large scale differences in natural scenes (such as adult cows in the foreground and calves in the background), especially for calves with extremely small pixel proportions, which have serious false negative problems.

[0005] 2) Complex background interference and occlusion problem: In the dense herds, complex vegetation and dynamic lighting scenes of real farms, dairy cow targets are easily partially occluded or confused with the background, resulting in a decrease in the model feature discrimination and false detection and localization drift.

[0006] 3) The problem of balancing model efficiency and accuracy: Many improvement schemes that increase network depth or complex attention modules to improve performance have high computational overhead and slow inference speed, making them difficult to deploy in edge devices of farms that require real-time monitoring.

[0007] 4) Limitations of static receptive field: Existing models mostly use convolutional kernels with fixed size and weights, which are difficult to adapt to the changing postures of cows, different parts and complex spatial relationships with the background, resulting in insufficient flexibility in feature extraction.

[0008] 5) Global and local context fusion problem: There is a contradiction between the model taking into account the large-scale scene context to locate the overall outline of adult dairy cows and focusing on local details to identify calves or occluded parts, and the feature representation ability of a single scale is limited.

[0009] 6) Lack of domain-adaptive optimization: Directly applying the general object detection model to the dairy cow detection task lacks targeted structural optimization for the characteristics of similar target textures, regular shapes, and dense groups in the livestock industry scenario, thus limiting the performance ceiling.

[0010] Therefore, how to systematically optimize the domain adaptability of multi-scale detection for dairy cows, while maintaining the efficiency of the model, improve the recall rate of very small targets (calves) and the localization accuracy of large targets (adult cows), and enhance the detection robustness in occlusion and complex backgrounds, is a technical problem that urgently needs to be solved in this field. Summary of the Invention

[0011] This invention aims to overcome the shortcomings of existing technologies and proposes a deep learning-based multi-target detection method for dairy cows. By introducing a collaborative improvement scheme of a high-resolution P2 detection branch, an adaptive receptive field convolution module RFAConv, and a large kernel separable attention module LSKA, it solves the technical problems in existing technologies, such as high false negative rates for extremely small targets like calves due to the significant differences in the scale of dairy cow targets, the difficulty of adapting static receptive fields to pose changes and occlusion scenarios, and insufficient global context modeling capabilities with high computational overhead. It achieves high-precision and robust detection of multi-scale dairy cow targets in complex natural pasture environments, significantly improving the recall rate of extremely small targets and the localization accuracy of large targets, while maintaining the efficiency of the model and the practicality of end-to-end deployment.

[0012] This invention proposes a deep learning-based multi-target detection method for dairy cows, comprising the following steps:

[0013] S01 Acquire the image of the cow to be detected;

[0014] S02 The cow image is input into the improved YOLO11 model, which is improved while maintaining the original backbone network as follows:

[0015] S0201 introduces a high-resolution P2 detection branch into the neck of the feature pyramid of the YOLO11 model. The P2 detection branch takes a shallow feature map with a downsampling factor of 4 in the backbone network as input and fuses it with high semantic features from the deep layer through an upsampling operation to generate a high-resolution feature map for detecting extremely small cow targets.

[0016] S0202 replaces the standard convolutional module in the downsampling path of the YOLO11 model with a receptive field adaptive convolutional module. The receptive field adaptive convolutional module dynamically generates spatial attention weights and feature bases from the input features through two parallel light quantum networks. Based on the local context of the input features, it dynamically adjusts the effective receptive field and feature aggregation method at each spatial location.

[0017] S0203 integrates a large kernel separable attention module LSKA before the P5 detection layer of the YOLO11 model. The LSKA module uses large-size depth separable convolution to capture long-range spatial dependencies in order to obtain global context information covering large-size cow targets. The importance of the extracted global features is recalibrated through the channel attention submodule.

[0018] S03 Obtain the multi-target detection results of cows output by the improved YOLO11 model. The detection results include the bounding box position, confidence score, and class label of each cow target.

[0019] Preferably, the construction process of the P2 detection branch includes: taking the feature map output from the second layer of the backbone network with a resolution of 1 / 4 of the original input image as input, the feature map is channel-stitched with the corresponding level features from the neck of the feature pyramid after downsampling and fusion to generate a fused high-resolution feature map, and independently connected to a dedicated detection head, which is used to perform bounding box regression and classification tasks.

[0020] Preferably, the forward process of the receptive field adaptive convolution module includes:

[0021] S020201 generates a spatial attention weight map through average pooling layers and point convolutional layers;

[0022] S020202 generates feature bases through depthwise separable convolutional layers;

[0023] S020203 performs element-wise multiplication of the spatial attention weight map with the feature base to achieve adaptive weighting;

[0024] S020204 performs rearrangement and standard convolution operations on the weighted features from step S020203 to complete feature integration and output.

[0025] Preferably, the LSKA module captures long-range spatial dependencies by decomposing standard two-dimensional large-size depth convolution kernels into cascaded one-dimensional convolution kernels.

[0026] Preferably, the forward process of the LSKA module includes:

[0027] S020301 performs horizontal depthwise convolution with a kernel size of 1×k and vertical depthwise convolution with a kernel size of k×1 on the input feature map in sequence to obtain a feature map that aggregates information from the k×k region.

[0028] S020302 generates a spatial attention map using a 1×1 convolutional layer and a sigmoid activation function;

[0029] S020303 multiplies the spatial attention map with the original input features to achieve adaptive feature recalibration.

[0030] Preferably, the feature pyramid of the improved YOLO11 model is expanded from the traditional P3-P5 to cover the P2-P5 scale spectrum with downsampling of 4x to 32x.

[0031] Preferably, in step S03, the multi-scale prediction results are subjected to weighted fusion and non-maximum suppression post-processing to generate the final unified dairy cow target detection result.

[0032] The technical solution of the present invention has at least the following technical effects:

[0033] 1. By introducing a high-resolution P2 detection branch, this invention expands the feature pyramid from the traditional P3-P5 to a P2-P5 scale spectrum covering 4 to 32 times downsampling, enhancing the utilization of shallow detail features and thus significantly improving the detection rate of extremely small dairy cow targets such as calves. This effectively alleviates the multi-scale detection problem and enhances the model's universal monitoring capability for dairy cows of different ages in the pasture.

[0034] 2. This invention employs an adaptive receptive field convolution module, which enables the convolution kernel weights to be dynamically generated based on the input content. This allows the network to adaptively adjust its receptive field and autonomously adjust the spatial focusing area of the convolution kernel according to local contexts such as cow pose changes and mutual occlusion. This improves the robustness and discriminative power of the model in feature extraction under occlusion, deformation, and complex backgrounds.

[0035] 3. This invention integrates a large-kernel separable attention module, which efficiently captures long-range spatial dependencies and global contextual information by decomposing large-size convolutional kernels into cascaded one-dimensional convolutions. This avoids the computational burden caused by large convolutional kernels and enhances the contour perception and localization accuracy of large targets such as adult dairy cows.

[0036] 4. This invention integrates the P2 high-resolution branch, the RFAConv module, and the LSKA module within a unified YOLO11 framework, achieving multi-level, multi-scale feature enhancement from adaptive extraction of local details to global context awareness. This collectively improves the model's overall detection performance for cow targets in complex natural scenes. Experiments show that the synergistic improvement of these three modules results in an mAP50 of 0.893, a 5.9 percentage point improvement over the baseline; and an mAP50-95 of 0.642, a 6.1 percentage point improvement over the baseline.

[0037] 5. Because both the RFAConv and LSKA modules of this invention are designed with computational efficiency in mind, the former achieves dynamic convolution through a lightweight parallel structure, while the latter uses depthwise separable convolution to decompose large kernel operations. This significantly improves detection accuracy while better controlling the overall computational complexity and inference latency of the model, which is beneficial for practical deployment in resource-constrained edge devices in aquaculture farms.

[0038] 6. This invention improves the model's feature discrimination capability in complex scenes through architectural improvements, thereby effectively reducing false detections and false negatives caused by factors such as cow aggregation, vegetation and fence obstruction, and changes in lighting, and improving the stability and reliability of the detection results. Attached Figure Description

[0039] Figure 1 This is a flowchart of the multi-target detection method for dairy cows proposed in the embodiments of this application;

[0040] Figure 2 This application provides a schematic diagram of the improved YOLO11 model and a schematic diagram of the P2 detection layer structure in Embodiment 1 of this application.

[0041] Figure 3 This is a schematic diagram of the RFAConv module structure in Embodiment 2 of this application;

[0042] Figure 4 This is a schematic diagram of the large kernel volume integral solution in Embodiment 2 of this application;

[0043] Figure 5 This is a schematic diagram of the LSKA module structure in Embodiment 3 of this application;

[0044] Figure 6 The image shows the multi-object detection performance of the model proposed in this embodiment on a self-built dairy cow dataset;

[0045] Figure 7 This is a graph showing the validation results of the model proposed in this embodiment on a self-built dairy cow dataset. Detailed Implementation

[0046] This application proposes a technical solution for a multi-target detection method for dairy cows based on deep learning, which generally includes the following steps:

[0047] (1) Obtain images containing dairy cows to be detected. Read the original video or image data collected from natural farms, perform size normalization processing, and use data augmentation techniques such as random cropping and color jitter to expand the dataset and improve the model's generalization ability under complex lighting and background.

[0048] The images can be visible light images captured in real time by devices such as fixed cameras or drones set up in the ranch, or they can be pre-stored images or video frames.

[0049] (2) The preprocessed image is input into the improved YOLO11 network, which, while maintaining the original backbone network, has undergone three collaborative improvements: ① A high-resolution P2 detection branch is added to the neck of the feature pyramid to retain more detailed information for the detection of extremely small targets. ② The standard convolutional modules on the three key downsampling paths within the feature pyramid are all replaced with receptive field adaptive convolution (RFAConv) modules. ③ A large kernel separable attention (LSKA) module is integrated before the final P5 detection layer; the LSKA module captures long-range spatial dependencies through large-size depth separable convolution and performs feature calibration through channel attention mechanism, enhancing the perception of the overall target contour and contextual information. Through the above improvements, the model can adaptively extract multi-scale features, enhancing the detection capability of cow targets in complex scenes. The improved network contains four detection heads of different scales: P2, P3, P4, and P5, which process feature maps of the corresponding scales respectively and independently perform bounding box regression and target classification tasks.

[0050] (3) Obtain the multi-target detection results of dairy cows output by the improved YOLO11 model. The multi-scale prediction results output by the four detection heads are weighted and fused and post-processed with non-maximum suppression (NMS) to generate the final unified dairy cow target detection results. The detection results include the bounding box position, confidence score and class label (at least including adult dairy cows and calves) of each detected dairy cow target, which can be directly applied to automated ranch monitoring, dairy cow number statistics and behavior analysis systems.

[0051] To better understand the above technical solutions, the following will provide a detailed explanation of the technical solutions in conjunction with the accompanying drawings and specific implementation methods.

[0052] Example 1

[0053] This first embodiment adds a high-resolution P2 detection branch to the neck of the feature pyramid, aiming to expand the model's scale perception range and improve its ability to detect extremely small targets (such as calves). The core of the P2 detection layer is the construction of a new detection branch that specifically processes high spatial resolution features to capture more detailed information to deal with extremely small targets such as calves.

[0054] The P2 detection layer first uses the feature map output from the second layer (downsampled by 4 times) in the backbone network as input. Where B is the batch size, C is the number of channels, and H and W are the spatial dimensions of the original input image at 1 / 4 resolution. This feature map is rich in spatial details and edge texture information, but its semantic level is relatively shallow.

[0055] Subsequently, this feature map undergoes an upsampling operation, followed by channel concatenation with corresponding level features from the neck of the feature pyramid that have undergone downsampling and fusion. This cascaded concatenation serves as the feature map output by layer P2. Unlike traditional designs that use only P3 as the minimum detection layer, the introduction of layer P2 avoids prematurely discarding high-resolution detail information.

[0056] The key to this mechanism lies in its construction of a parallel, specialized high-resolution detection path. By introducing this path, the model's feature pyramid expands from the traditional P3-P5 to P2-P5, forming a more complete scale spectrum covering 4x to 32x downsampling. This design not only significantly enhances the ability to represent tiny pixel regions in images but also ensures that the model maintains optimal detection sensitivity at different target scales, especially for extremely small targets with pixel areas less than 32x32.

[0057] The resulting high-resolution feature map is then independently fed into a dedicated detection head. This detection head has the same structure as other scale detection heads and performs bounding box regression and classification tasks in parallel. The P2 detection layer is integrated between the neck and head of the YOLO11 model, significantly enhancing the model's ability to retain and utilize detailed information with only a slight increase in computational overhead, thereby improving the detection rate of very small dairy cows such as calves.

[0058] By introducing a high-resolution P2 detection layer into the feature pyramid, a systematic utilization of shallow, detailed features is achieved. This improvement expands the granularity of the model's multi-scale detection, enabling it to specifically handle and optimize the detection of extremely small targets. This mechanism significantly enhances the model's ability to perceive tiny targets at the pixel level, providing a crucial feature foundation for locating calves in dense, distant scenes, thereby improving the network's overall coverage and recall of multi-scale cow targets.

[0059] Example 2

[0060] This second embodiment replaces the standard convolutional modules on the three key downsampling paths within the feature pyramid with receptive field adaptive convolutional (RFAConv) modules. This aims to enhance the model's adaptive modeling ability to complex spatial contexts and address the problem of traditional convolutional kernels having fixed weights and difficulty adapting to changes in input features. The core of this second embodiment lies in using two parallel lightweight quantum networks to dynamically generate spatial attention weights and feature bases from the input features, achieving content-adaptive feature weighting and aggregation, simulating flexible visual receptive field adjustment.

[0061] Given parameters B (batch size), C (number of channels), and H and W (spatial dimensions of the original input image), and given input features... Module output Its forward process is as follows:

[0062] (a) Generation of dynamic spatial attention weights

[0063] For the input feature map Average pooling is performed to capture local contextual information, followed by pointwise convolution mapping to generate weights for each spatial location corresponding to sub-locations within a K×K receptive field, where K=3, i.e., the convolution kernel size is 3×3. The generated raw weights... Softmax normalization is applied to the dimensions to obtain spatial attention weights. This ensures that the sum of the weights at each position is 1, thus forming an adaptive attention distribution.

[0064] (ii) Parallel feature basis generation

[0065] Meanwhile, input features Spatial features are extracted through a depthwise convolution, followed by a pointwise convolution for channel transformation and dimensionality upscaling, outputting the feature base. With weight dimension Alignment.

[0066] (III) Adaptive Feature Weighting and Aggregation

[0067] Spatial attention weights With characteristic basis Element-wise multiplication is performed to achieve content-adaptive feature selection and fusion.

[0068] (iv) Feature fusion and output

[0069] By rearranging the K×K weighted feature blocks at each position in the result, we obtain a reconstructed feature map. The last standard convolutional layer pair Perform cross-channel fusion and spatial downsampling to output the final features. .

[0070] The RFAConv module is integrated into the three key downsampling paths of the YOLO11 model feature pyramid, replacing the original standard convolution. Through a dynamic weight mechanism, the network can autonomously adjust the spatial focusing region of the convolution kernel according to the local context of the input content and dynamically allocate spatial attention weights. When faced with pose changes, occlusion, and background interference of cow targets, it significantly enhances the robustness and discriminative power of feature extraction, without introducing too many parameters, thus maintaining the efficiency of the model.

[0071] Example 3

[0072] Please refer to Figures 4-5 In this third embodiment, a large kernel separable attention (LSKA) module is integrated before the final P5 detection layer. This aims to improve the model's ability to model global context information while avoiding the problem of a sharp increase in computational complexity caused by large convolutional kernels.

[0073] The core of LSKA lies in decomposing the standard two-dimensional large-size deep convolution kernel (such as k×k) into two cascaded one-dimensional convolution kernels (1×k and k×1), thereby achieving a better balance between performance and efficiency.

[0074] The LSKA module is designed and built upon the standard large kernel attention module. Given an input feature map... Its core forward process is as follows:

[0075] (i) Separable depthwise convolution decomposition. This is key to LSKA's reduction of computational complexity. For a k×k depthwise convolution, LSKA equivalently replaces it with two consecutive, separable one-dimensional depthwise convolution operations:

[0076] Horizontal Convolution: Apply depthwise convolution (DepthwiseConv2d) with a kernel size of 1×k to aggregate features in the horizontal direction.

[0077]

[0078] Vertical convolution: Apply depthwise convolution with a kernel size of k×1 to the above results to aggregate features in the vertical direction.

[0079]

[0080] After the above cascaded operations, the output feature map is obtained. Each location aggregates information from its surrounding k×k region, achieving the same effective receptive field as standard large kernel convolution, but with significantly reduced computational complexity. Down to .

[0081] (ii) Compatibility decomposition with dilated convolutions. To model longer-range dependencies, LSKA applies the above decomposition idea to the dilated depth convolution component. For a large kernel with a dilation rate d and an equivalent receptive field of k′×k′, LSKA decomposes it into a concatenation of a 1×k′ and a k′×1 dilated depth convolution, which significantly improves computational efficiency while maintaining the advantage of a large receptive field.

[0082] (III) Attention Map Generation and Feature Relabeling. The features processed by separable convolution are used to generate a spatial attention map A through a 1×1 convolutional layer:

[0083]

[0084] Here, σ represents the Sigmoid activation function. Finally, this attention map is multiplied by the original input features to achieve adaptive feature recalibration.

[0085]

[0086] The LSKA module is integrated before the final P5 detection layer of the improved YOLO11 model. Through efficient big kernel operations, it enables the network to capture global contextual information covering large targets such as adult cows at a lower computational cost, thereby enhancing the model's ability to understand the overall outline of the target and the relationship between the scene.

[0087] Example 4

[0088] To verify the effectiveness of this invention, an ablation experiment was conducted on a self-built dairy cow dataset, using YOLOv11n as the baseline model and the same training hyperparameters, to form the final dairy cow multi-target detection model. Its effectiveness was then verified through experiments. Precision, recall, and mean precision were used as evaluation metrics, and the experimental results are shown in Table 1. It can be seen that our method can significantly improve the recognition accuracy.

[0089] Using YOLOv11n (official weights) as the baseline model, comparisons were made under the same dataset and training settings.

[0090] Table 1 Ablation Experiment

[0091] YOLOV11n P2 RFAConv LSKA Precision Recall mAP50 mAP50-95 √ 0.786 0.775 0.834 0.581 √ √ √ 0.795 0.781 0.852 0.587 √ √ √ 0.786 0.808 0.846 0.562 √ √ √ 0.816 0.799 0.847 0.568 √ √ √ √ 0.83 0.811 0.893 0.642

[0092] As shown in Table 1, introducing the RFAConv and LSKA modules slightly improved precision, recall, and mAP50; introducing the P2 branch and LSKA module mainly improved recall and mAP50, while mAP50-95 decreased; introducing the P2 branch and RFAConv module slightly improved precision, recall, and mAP50, while mAP50-95 decreased. Table 1 also shows that the complete technical solution, which simultaneously introduces the P2 branch, RFAConv, and LSKA modules, achieves optimal performance across all metrics, with mAP50 reaching 0.893 (a 5.9 percentage point improvement over the baseline) and mAP50-95 reaching 0.642 (a 6.1 percentage point improvement over the baseline). This invention significantly improves the detection performance of the YOLO11 model for multi-scale dairy cow targets in natural pasture environments by introducing the P2 high-resolution branch, the RFAConv dynamic convolution module, and the LSKA large kernel attention module. It has the advantages of high accuracy, high robustness, and high efficiency, and is suitable for practical applications in intelligent animal husbandry.

[0093] The above description is merely a preferred embodiment of the present invention and is not intended to limit the invention. Various modifications and variations can be made to the present invention by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the scope of protection of the present invention.

Claims

1. A deep learning-based multi-target detection method for dairy cows, characterized in that, Includes the following steps: S01 Acquire the image of the cow to be detected; S02 The cow image is input into the improved YOLO11 model, which is improved while maintaining the original backbone network as follows: S0201 introduces a high-resolution P2 detection branch into the neck of the feature pyramid of the YOLO11 model. The P2 detection branch takes a shallow feature map with a downsampling factor of 4 in the backbone network as input and fuses it with high semantic features from the deep layer through an upsampling operation to generate a high-resolution feature map for detecting extremely small cow targets. S0202 replaces the standard convolutional module in the downsampling path of the YOLO11 model with a receptive field adaptive convolutional module. The receptive field adaptive convolutional module dynamically generates spatial attention weights and feature bases from the input features through two parallel light quantum networks. Based on the local context of the input features, it dynamically adjusts the effective receptive field and feature aggregation method at each spatial location. S0203 integrates a large kernel separable attention module LSKA before the P5 detection layer of the YOLO11 model. The LSKA module uses large-size depth separable convolution to capture long-range spatial dependencies in order to obtain global context information covering large-size cow targets. The importance of the extracted global features is recalibrated through the channel attention submodule. S03 Obtain the multi-target detection results of cows output by the improved YOLO11 model. The detection results include the bounding box position, confidence score, and class label of each cow target.

2. The method for multi-target detection in dairy cows as described in claim 1, characterized in that, The construction process of the P2 detection branch includes: taking the feature map output from the second layer of the backbone network with a resolution of 1 / 4 of the original input image as input, the feature map is concatenated with the corresponding layer features from the neck of the feature pyramid after downsampling and fusion to generate a fused high-resolution feature map, and independently connected to a dedicated detection head, which is used to perform bounding box regression and classification tasks.

3. The method for multi-target detection in dairy cows as described in claim 1, characterized in that, The forward process of the receptive field adaptive convolutional module includes: S020201 generates a spatial attention weight map through average pooling layers and point convolutional layers; S020202 generates feature bases through depthwise separable convolutional layers; S020203 performs element-wise multiplication of the spatial attention weight map with the feature base to achieve adaptive weighting; S020204 performs rearrangement and standard convolution operations on the weighted features from step S020203 to complete feature integration and output.

4. The method for multi-target detection in dairy cows as described in claim 1, characterized in that, The LSKA module captures long-range spatial dependencies by decomposing standard large-size two-dimensional depthwise convolutional kernels into cascaded one-dimensional convolutional kernels.

5. The method for multi-target detection in dairy cows as described in claim 4, characterized in that, The forward process of the LSKA module includes: S020301 performs horizontal depthwise convolution with a kernel size of 1×k and vertical depthwise convolution with a kernel size of k×1 on the input feature map in sequence to obtain a feature map that aggregates information from the k×k region. S020302 generates a spatial attention map using a 1×1 convolutional layer and a sigmoid activation function; S020303 multiplies the spatial attention map with the original input features to achieve adaptive feature recalibration.

6. The method for multi-target detection in dairy cows as described in claim 1, characterized in that, The feature pyramid of the improved YOLO11 model is expanded from the traditional P3-P5 to cover the P2-P5 scale spectrum with downsampling from 4x to 32x.

7. The method for multi-target detection in dairy cows as described in claim 1, characterized in that, In step S03, the multi-scale prediction results are subjected to weighted fusion and non-maximum suppression post-processing to generate the final unified dairy cow target detection result.