Free-range chicken multi-scale target detection method and device based on large model
By employing a multi-scale target detection method based on a large model, and utilizing SDTM and CoT modules to extract and fuse features from chicken flock images, the problem of small target loss and occlusion in cageless chicken detection is solved, achieving accurate identification in complex environments.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SOUTH CHINA AGRICULTURAL UNIVERSITY
- Filing Date
- 2023-12-26
- Publication Date
- 2026-06-23
AI Technical Summary
Target detection of cageless chickens faces challenges in complex environments, such as the easy loss of small target features and the difficulty in detection and differentiation caused by occlusion due to flock aggregation. Existing technologies are unable to achieve accurate identification.
A multi-scale target detection method based on a large model is adopted. By extracting and fusing features at different levels from chicken flock images, and combining SDTM and CoT modules, multi-scale feature maps are generated to determine the location and category information of each chicken in the flock images.
It achieves accurate identification of cageless chickens in complex scenarios, solves the detection difficulties caused by the loss of small target features and occlusion caused by chicken flocks, and improves the accuracy and generalization ability of detection.
Smart Images

Figure CN118675199B_ABST
Abstract
Description
Technical Field
[0001] This disclosure relates to the field of image processing technology, and more specifically, to a method and apparatus for multi-scale target detection of free-range chickens based on a large model. Background Technology
[0002] Due to its high protein and low cholesterol content, poultry consumption is increasing. Compared to caged chickens, cageless chickens have more space and freedom to express natural behaviors, resulting in significantly lower abdominal fat percentage and higher muscle mass. However, cageless rearing may lead to a higher probability of injury and disease; therefore, research on health monitoring and welfare management of cageless chickens has significant practical implications. Summary of the Invention
[0003] The purpose of this disclosure is to provide a method and apparatus for multi-scale target detection of free-range chickens based on a large model, so as to solve the above-mentioned technical problems.
[0004] To achieve the above objectives, the first aspect of this disclosure provides a multi-scale target detection method for free-range chickens based on a large model, the method comprising:
[0005] Obtain images of cageless chickens;
[0006] The chicken flock image is input into the target detection model to obtain the location and category information of each chicken in the chicken flock image;
[0007] The target detection model is used to obtain the location and category information of each chicken in the chicken flock image through the following operations:
[0008] The chicken flock image is subjected to feature extraction at different levels to obtain hierarchical feature maps at different levels;
[0009] Feature fusion processing is performed on the hierarchical feature maps of different levels to obtain scale feature maps of different scales, and feature stitching processing is performed on the scale feature maps of different scales to obtain the target multi-scale feature map.
[0010] The location and category information of each chicken in the chicken flock image are determined based on the target multi-scale feature map.
[0011] Optionally, the step of extracting features from the chicken flock image at different levels to obtain hierarchical feature maps at different levels includes:
[0012] The first-level feature extraction is performed on the chicken flock image to obtain the first-level feature map;
[0013] The first-level feature map is subjected to second-level feature extraction to obtain the second-level feature map;
[0014] The second-level feature map is subjected to third-level feature extraction to obtain the third-level feature map;
[0015] The first-level feature map, the second-level feature map, and the third-level feature map are determined as hierarchical feature maps of different levels.
[0016] Optionally, the step of performing a first-level feature extraction on the chicken flock image to obtain a first-level feature map includes:
[0017] The chicken flock image is downsampled to obtain a first feature map;
[0018] Extract features from different levels in the first feature map, and perform feature fusion processing on the extracted features to obtain the second feature map;
[0019] The depth features in the second feature map are extracted, and the extracted depth features are subjected to feature fusion processing to obtain the third feature map;
[0020] Different levels of features are extracted from the third feature map, and feature fusion processing is performed on the extracted different levels of features to obtain the first level feature map.
[0021] Optionally, the step of performing second-level feature extraction on the first-level feature map to obtain a second-level feature map includes:
[0022] Extract the depth features from the first-level feature map, and perform feature fusion processing on the extracted depth features to obtain the fourth feature map;
[0023] Different levels of features are extracted from the fourth feature map, and feature fusion processing is performed on the extracted features to obtain the fifth feature map;
[0024] Different levels of features are extracted from the fifth feature map, and feature fusion processing is performed on the extracted different levels of features to obtain the second level feature map.
[0025] Optionally, the step of performing third-level feature extraction on the second-level feature map to obtain a third-level feature map includes:
[0026] Extract features from different levels in the second-level feature map, and perform feature fusion processing on the extracted features to obtain the sixth feature map;
[0027] A convolution operation is performed on each input channel of the sixth feature map to obtain multiple seventh feature maps, and a pointwise convolution operation is performed on each of the seventh feature maps to obtain multiple eighth feature maps. The multiple eighth feature maps are then linearly combined to obtain a ninth feature map.
[0028] Different levels of features are extracted from the ninth feature map, and feature fusion processing is performed on the extracted features to obtain the third level feature map.
[0029] Optionally, the step of extracting depth features from the second feature map and performing feature fusion processing on the extracted depth features to obtain a third feature map includes:
[0030] Feature extraction is performed on the second feature map to obtain a first feature sub-map, and the first feature sub-map is downsampled to obtain a second feature sub-map;
[0031] Feature extraction is performed on the second feature map to obtain a third feature sub-map. The third feature sub-map is then downsampled to obtain a fourth feature sub-map. Finally, feature extraction is performed on the fourth feature sub-map to obtain a fifth feature sub-map.
[0032] The fourth feature sub-image is downsampled to obtain the sixth feature sub-image. Key features in the sixth feature sub-image are extracted to obtain the seventh feature sub-image. The seventh feature sub-image is then deconvolved to obtain the eighth feature sub-image.
[0033] The second feature sub-image, the fifth feature sub-image, and the eighth feature sub-image are subjected to feature splicing processing to obtain the third feature image.
[0034] Optionally, the feature fusion processing of hierarchical feature maps at different levels to obtain scale feature maps at different scales includes:
[0035] Extract the static context features and dynamic context features from the third-level feature map, and perform feature fusion processing on the static context features and dynamic context features to obtain the fourth-level feature map;
[0036] The second-level feature map and the fourth-level feature map are subjected to feature fusion processing to obtain a first-scale feature map;
[0037] The first scale feature map and the first level feature map are subjected to feature fusion processing to obtain the second scale feature map;
[0038] The first-scale feature map and the second-scale feature map are fused to obtain the third-scale feature map.
[0039] The fourth-level feature map and the third-scale feature map are subjected to feature fusion processing to obtain the fourth-scale feature map;
[0040] The first scale feature map, the second scale feature map, the third scale feature map, and the fourth scale feature map are determined as scale feature maps of different scales.
[0041] Optionally, the step of performing feature stitching on feature maps of different scales to obtain a target multi-scale feature map includes:
[0042] The second-scale feature map, the third-scale feature map, and the fourth-scale feature map are subjected to feature stitching to obtain a target multi-scale feature map.
[0043] Optionally, determining the location and category information of each chicken in the chicken flock image based on the target multi-scale feature map includes:
[0044] For each feature in the target multi-scale feature map, determine the intersection-union score and classification score of the feature;
[0045] The target features are determined based on a preset intersection-union score threshold, a preset classification score threshold, the intersection-union score, and the classification score.
[0046] Based on the target features, determine the location and category information of each chicken in the chicken flock image.
[0047] A second aspect of this disclosure provides a multi-scale target detection device for free-range chickens based on a large model, the device comprising:
[0048] The acquisition module is used to acquire images of cageless chicken flocks;
[0049] The processing module is used to input the chicken flock image into the target detection model to obtain the location information and category information of each chicken in the chicken flock image;
[0050] The target detection model includes:
[0051] The feature extraction module is used to extract features from the chicken flock image at different levels to obtain hierarchical feature maps at different levels.
[0052] The feature fusion module is used to perform feature fusion processing on hierarchical feature maps of different levels to obtain scale feature maps of different scales, and to perform feature stitching processing on scale feature maps of different scales to obtain target multi-scale feature maps.
[0053] The determination module is used to determine the location and category information of each chicken in the chicken flock image based on the target multi-scale feature map.
[0054] The above technical solution allows for the direct input of images of cageless chickens into a target detection model, thereby obtaining the location and category information of each chicken in the image. This enables end-to-end detection of each chicken in the flock. Furthermore, when detecting chickens in a flock image, the target detection model first extracts features from different levels within the image, then fuses these features to obtain a target-scale feature map with multi-scale information. Finally, the location and category information of each chicken are obtained by detecting the target-scale feature map. This enables accurate identification of cageless chickens of varying sizes in complex scenarios, solving the problems of easily lost small target features and difficulty in detection and differentiation caused by occlusion due to flock aggregation.
[0055] Other features and advantages of this disclosure will be described in detail in the following detailed description section. Attached Figure Description
[0056] The accompanying drawings are provided to further illustrate the present disclosure and form part of the specification. They are used together with the following detailed description to explain the present disclosure, but do not constitute a limitation thereof. In the drawings:
[0057] Figure 1 This is a flowchart illustrating a multi-scale target detection method for free-range chickens based on a large model, according to an exemplary embodiment of this disclosure;
[0058] Figure 2 This is a schematic diagram of the model structure of a target detection model according to an exemplary embodiment of the present disclosure;
[0059] Figure 3 This is a schematic diagram of the structure of an SDTM module according to an exemplary embodiment of the present disclosure;
[0060] Figure 4 This is a schematic diagram of the structure of a BiFormer module according to an exemplary embodiment of the present disclosure;
[0061] Figure 5 This is a schematic diagram of the structure of a CoT module according to an exemplary embodiment of the present disclosure;
[0062] Figure 6 This is a schematic diagram illustrating detailed val loss and AP50 variation curves in an ablation experiment according to an exemplary embodiment of this disclosure;
[0063] Figure 7 This is a schematic diagram illustrating the EMSC-DETR visualization results of nine scenarios in a test set 1 according to an exemplary embodiment of this disclosure;
[0064] Figure 8This is a schematic diagram illustrating the EMSC-DETR and RT-DETR-L visualization results of three scenarios in a test set 2 according to an exemplary embodiment of this disclosure;
[0065] Figure 9 This is a schematic diagram of an image with added noise at different ratios, according to an exemplary embodiment of the present disclosure;
[0066] Figure 10 This is a schematic diagram illustrating the detection performance of different models on test sets 1 and 2 containing different proportions of noise, according to an exemplary embodiment of the present disclosure.
[0067] Figure 11 This is a schematic diagram illustrating fine-tuning experimental results obtained with different numbers of training images and training rounds according to an exemplary embodiment of the present disclosure;
[0068] Figure 12 This is a schematic diagram illustrating the detection results of EMSC-DETR in test set 3 after fine-tuning experiments, according to an exemplary embodiment of the present disclosure;
[0069] Figure 13 This is a schematic diagram illustrating the detection results of EMSC-DETR in test set 4 after fine-tuning experiments, according to an exemplary embodiment of this disclosure;
[0070] Figure 14 This is a structural block diagram of a multi-scale target detection device for free-range chickens based on a large model, according to an exemplary embodiment of this disclosure. Detailed Implementation
[0071] The specific embodiments of this disclosure will be described in detail below with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are for illustration and explanation only and are not intended to limit this disclosure.
[0072] It should be noted that all actions involving the acquisition of signals, information, or data in this disclosure are carried out in compliance with the relevant data protection laws and policies of the country where the location is situated, and with authorization from the owner of the relevant device.
[0073] As mentioned in the background section, poultry consumption is increasing due to its high protein and low cholesterol content. Compared to caged chickens, cageless chickens have more space and freedom to express their natural behaviors, resulting in significantly lower abdominal fat and higher muscle mass, making their meat more appealing to consumers. Currently, cageless chicken accounts for about one-seventh of the meat market in China, and consumption is expected to continue to increase over the next decade. However, cageless farming may lead to higher rates of injury and disease, making research on health monitoring and welfare management of cageless chickens of significant practical importance.
[0074] With the development of precision livestock farming, computer vision applications such as target tracking, behavior analysis, disease early warning, and weight prediction have become research hotspots. Target detection, as the underlying supporting technology for these applications and affecting the performance of subsequent tasks, is of significant research importance. However, target detection for cageless chickens is more challenging than that for caged chickens and large-sized livestock, especially under the commonly used camera side-view shooting conditions. First, the cageless environment is more complex and variable, and chickens may resemble the background and ground, making them difficult to distinguish. Second, the behavior of cageless chickens can change drastically in a short period, and their shape, posture, and orientation can be arbitrary at any time. Third, chickens are gregarious animals, leading to severe occlusion problems. Fourth, the varying distances of chickens from the camera result in targets of different sizes, increasing the difficulty of feature extraction. To address these challenges, this proposal conducts research on target detection for cageless chickens in commercial farming environments, which is fundamental to realizing intelligent health and welfare monitoring such as behavior analysis and disease early warning.
[0075] Over the past decade, machine learning has been widely studied in object detection tasks due to its powerful feature extraction and pattern recognition capabilities. Convolutional neural networks (CNNs), as one of the representative algorithms of deep learning, have been successfully applied to intelligent poultry farming monitoring. Subedi et al. established a deep learning model based on You Only Look Once Version 5 (YOLOv5) to identify feather-pecking behavior in cageless chickens. Cuan et al. collected the calls of healthy chickens and chickens with avian influenza and proposed a chicken sound convolutional neural network (CSCNN) for detecting avian influenza in chickens. Nasiri et al. focused on lameness in broiler chickens and proposed a deep convolutional neural network-long short-term memory (CNN-LSTM) to detect and track seven key points on the body of walking broiler chickens. Yang et al. developed a YOLOv5-based convolutional neural network for target detection in cageless chicken farming using 1000 and 200 images as training and testing sets, respectively, achieving an accuracy exceeding 95%. However, due to the unavoidable limitations of the receptive field in convolutional operations and the inherent inductive bias in convolutional structures, it is disadvantageous for further establishing long-range dependencies and global context. Due to the non-rigid behavior of chickens, these CNN-based methods cannot provide guidelines that easily adapt to individual differences in size, shape, and texture. Furthermore, while convolution and pooling operations are effective at extracting features from large-scale targets, the accuracy for small-scale targets is lower than that for large-scale targets, meaning that CNN models may not be well-suited for detecting chicken targets at various scales in cageless chicken farming environments.
[0076] Unlike CNNs, transformers can capture complex spatial variations and global relationships between image elements through self-attention mechanisms. The excellent generalization and robustness of transformer-based models have been proven. The Detection Transformer (DETR) introduces the transformer to object detection tasks, using a CNN as the backbone network and combining it with a transformer encoder and decoder. Compared to one-stage detectors (such as the YOLO series) and two-stage detectors (such as the faster Faster Regions with CNNFeatures, Faster R-CNN), DETR is a fully end-to-end trainable architecture. It lacks a Non-Maximum Suppression (NMS) post-processing step and prior knowledge and constraints such as bounding boxes, greatly simplifying the object detection process.
[0077] However, DETR has drawbacks such as slow training convergence speed and poor detection capability for small targets.
[0078] In view of this, the present disclosure provides a method and apparatus for multi-scale target detection of free-range chickens based on a large model, in order to overcome the above-mentioned technical problems.
[0079] The embodiments of this disclosure will be further explained below with reference to the accompanying drawings.
[0080] Figure 1 This is a flowchart illustrating a multi-scale target detection method for free-range chickens based on a large model, according to an exemplary embodiment of this disclosure. (Refer to...) Figure 1 The method may include the following steps:
[0081] S101: Obtain images of cageless chickens.
[0082] The image of the chickens can be obtained by taking a picture of the chickens from the side.
[0083] S102: Input the chicken flock image into the target detection model to obtain the location information and category information of each chicken in the chicken flock image.
[0084] The target detection model is used to obtain the location and category information of each chicken in the chicken flock image through the following operations:
[0085] The chicken flock image is subjected to feature extraction at different levels to obtain hierarchical feature maps at different levels; feature fusion processing is performed on the hierarchical feature maps at different levels to obtain scale feature maps at different scales; and feature stitching processing is performed on the scale feature maps at different scales to obtain a target multi-scale feature map; the location information and category information of each chicken in the chicken flock image are determined based on the target multi-scale feature map.
[0086] The above technical solution allows for the direct input of images of cageless chickens into a target detection model, thereby obtaining the location and category information of each chicken in the image. This enables end-to-end detection of each chicken in the flock. Furthermore, when detecting chickens in a flock image, the target detection model first extracts features from different levels within the image, then fuses these features to obtain a target-scale feature map with multi-scale information. Finally, the location and category information of each chicken are obtained by detecting the target-scale feature map. This enables accurate identification of cageless chickens of varying sizes in complex scenarios, solving the problems of easily lost small target features and difficulty in detection and differentiation caused by occlusion due to flock aggregation.
[0087] To better understand the multi-scale target detection method for free-range chickens based on a large model in this disclosure, the following details the processing of chicken flock images by the target detection model.
[0088] In a possible implementation, the step of extracting features from the chicken flock image at different levels to obtain hierarchical feature maps at different levels may include:
[0089] The chicken flock image is subjected to first-level feature extraction to obtain a first-level feature map; the first-level feature map is subjected to second-level feature extraction to obtain a second-level feature map; the second-level feature map is subjected to third-level feature extraction to obtain a third-level feature map; the first-level feature map, the second-level feature map, and the third-level feature map are determined as hierarchical feature maps of different levels.
[0090] In this process, the first-level feature extraction is performed on the chicken flock image to obtain the first-level feature map, which can be described as follows: Figure 2As shown in the backbone network section, the chicken flock image is input into the first HGStem (High-Resolution Global Stem Module) module, which performs a 4x downsampling process on the chicken flock image to obtain the first feature map. Next, the first feature map is input into the first HGBlock (High-Resolution Global Block Module) module, which extracts features from different levels within the first feature map and performs feature fusion processing on the extracted features to obtain the second feature map. Then, the second feature map is input into the first SDTM (space-to-depth transformer module) module, which extracts depth features from the second feature map and performs feature fusion processing on the extracted depth features to obtain the third feature map. Finally, the third feature map is input into the second HGBlock module, which extracts features from different levels within the third feature map and performs feature fusion processing on the extracted features to obtain the first-level feature map.
[0091] That is, according to one embodiment of this disclosure, the step of performing a first-level feature extraction on the chicken flock image to obtain a first-level feature map may include:
[0092] The chicken image is downsampled to obtain a first feature map; different levels of features are extracted from the first feature map, and feature fusion processing is performed on the extracted different levels of features to obtain a second feature map; depth features are extracted from the second feature map, and feature fusion processing is performed on the extracted depth features to obtain a third feature map; different levels of features are extracted from the third feature map, and feature fusion processing is performed on the extracted different levels of features to obtain a first level feature map.
[0093] The structure of the SDTM module can be as follows: Figure 3 As shown, SDTM consists of two branches. One branch contains only the DWConv (Depthwise Convolution) operation. DWConv is a channel-based convolutional structure that reduces model parameters and computational cost compared to ordinary convolution. For the second feature of the input... Figure X , X∈R H×W×C (H represents the height of the second feature map, W represents the width of the second feature map, and C represents the number of channels in the second feature map). After feature extraction and 2x downsampling in the first branch, the second feature map is output. Figure X 1, In another branch, SDTM reduces the feature map while preserving as much detail as possible through DWConv and SPD (space-to-depth) operations, obtaining the fourth feature. Figure X 2, Next, the SPD operation is performed again to further transform the spatial information of the feature map into depth information, achieving feature fusion of spatial and depth, and the resulting feature size is... The sixth feature subgraph is then used. Subsequently, the BiFormer (Bi-directional Transformer) module is used to find key features from long-distance regions and supplement global information. The structure of BiFormer is as follows: Figure 4 As shown, BiFormer utilizes bi-level routing attention (BRA) as its basic building block, supporting sparse and fine-grained attention. BRA first divides the feature map into S×S non-overlapping regions, each region containing... There are 10 feature vectors. By performing a linear projection on the reconstructed vectors, the query vector, key vector, and value vector are obtained according to the following formula.
[0094] Q = X r W q
[0095] K = X r W k
[0096] V = X r W v
[0097] Among them, W q W represents the weight matrix of the linear projection corresponding to Q. k W represents the weight matrix of the linear projection corresponding to K. v Let Q represent the weight matrix of the linear projection of V, Q represent the query vector, K represent the key vector, V represent the value vector, and X represent the value vector. r This represents the second feature map after being divided into different regions.
[0098] To reduce the overhead of calculating relationships with irrelevant regions, BRA employs a region-level routing mechanism to filter out irrelevant key-value pairs at a coarse-grained level. To achieve this, for each non-overlapping region, the average of Q and K is calculated to obtain Q. r and K r Then, the adjacency matrix A is calculated according to the following formula. r The matrix captures S 2 Semantic relevance between regions.
[0099] A r =Q r (K r ) T
[0100] Where T represents the matrix transpose.
[0101] To minimize the interaction time and computational cost between regions, BRA searches for the top k most relevant regions and focuses on their relationships to generate an index matrix I. r As shown in the following formula:
[0102] I r =topIndex(A r )
[0103] Subsequently, BRA utilizes fine-grained token-token attention of the relevant region and integrates it into the computer's graphics processing unit operation, as shown in the following equation:
[0104] K g =g(K,I r )
[0105] V g =g(V,I) r )
[0106] O = Attention(Q,K) g V g )
[0107] Among them, K g V represents a matrix formed by concatenating the key vectors corresponding to each most relevant region in the index matrix. g This represents a matrix formed by concatenating the value vectors corresponding to each most relevant region in the index matrix. O represents... Figure 4 The output of the softmax function is shown in the table, and g() represents the graphics processing unit integrated into the computer.
[0108] Finally, the eighth eigenvalue is obtained through the Deconvolution (DeConv) operation. Figure X 3, Furthermore, the fusion part of SDTM fuses X1, X2, and X3 along the channel dimension through a connection operation, and finally outputs the third feature after passing through DWConv. Figure X ′ ,
[0109] That is, according to one embodiment of this disclosure, the step of extracting depth features from the second feature map and performing feature fusion processing on the extracted depth features to obtain a third feature map may include:
[0110] Feature extraction is performed on the second feature map to obtain a first feature sub-map, and downsampling is performed on the first feature sub-map to obtain a second feature sub-map; feature extraction is performed on the second feature map to obtain a third feature sub-map, and downsampling is performed on the third feature sub-map to obtain a fourth feature sub-map, and feature extraction is performed on the fourth feature sub-map to obtain a fifth feature sub-map; downsampling is performed on the fourth feature sub-map to obtain a sixth feature sub-map, key features are extracted from the sixth feature sub-map to obtain a seventh feature sub-map, and upsampling is performed on the seventh feature sub-map to obtain an eighth feature sub-map; feature concatenation is performed on the second feature sub-map, the fifth feature sub-map, and the eighth feature sub-map to obtain a third feature map.
[0111] Through the above approach, the SDTM module can transform spatial information of features into depth information and fuse local and global features in the shallow layers of the network. While improving the computational efficiency of the transformer, the SDTM module significantly enhances the model's detection performance and generalization ability.
[0112] Specifically, the second-level feature extraction is performed on the first-level feature map to obtain the second-level feature map, which can be done as follows: Figure 2 As shown in the backbone network section, the first-level feature map is input into the second SDTM module, which extracts deep features from the first-level feature map and performs feature fusion processing on the extracted deep features to obtain the fourth feature map. Next, the fourth feature map is input into the third HGBlock module, which extracts features from different levels in the fourth feature map and performs feature fusion processing on the extracted features from different levels to obtain the fifth feature map. Then, the fifth feature map is input into the fourth HGBlock module, which extracts features from different levels in the fifth feature map and performs feature fusion processing on the extracted features from different levels to obtain the second-level feature map.
[0113] That is, according to one embodiment of this disclosure, the step of performing second-level feature extraction on the first-level feature map to obtain a second-level feature map may include:
[0114] Extract the depth features from the first-level feature map and perform feature fusion processing on the extracted depth features to obtain the fourth feature map; extract different-level features from the fourth feature map and perform feature fusion processing on the extracted different-level features to obtain the fifth feature map; extract different-level features from the fifth feature map and perform feature fusion processing on the extracted different-level features to obtain the second-level feature map.
[0115] The process of the second SDTM module extracting depth features from the first-level feature map and performing feature fusion processing on the extracted depth features can be referred to the process of the first SDTM module processing the second feature map, and will not be repeated here.
[0116] Specifically, performing third-level feature extraction on the second-level feature map yields the third-level feature map as follows: Figure 2 As shown in the backbone network section, the second-level feature map is input into the fifth HGBlock module. The fifth HGBlock extracts different levels of features from the second-level feature map and performs feature fusion processing on the extracted different levels of features to obtain the sixth feature map. Next, the sixth feature map is input into the DWConv module. The DWConv module performs convolution operations on each input channel of the sixth feature map to obtain multiple seventh feature maps. Then, pointwise convolution operations are performed on each seventh feature map to obtain multiple eighth feature maps. These multiple eighth feature maps are then linearly combined to obtain the ninth feature map. Finally, the ninth feature map is input into the sixth HGBlock module. The sixth HGBlock extracts different levels of features from the ninth feature map and performs feature fusion processing on the extracted different levels of features to obtain the third-level feature map.
[0117] That is, according to one embodiment of this disclosure, the step of performing third-level feature extraction on the second-level feature map to obtain a third-level feature map may include:
[0118] Different levels of features are extracted from the second-level feature map, and feature fusion processing is performed on the extracted features to obtain the sixth feature map; convolution operation is performed on each input channel of the sixth feature map to obtain multiple seventh feature maps, and pointwise convolution operation is performed on each seventh feature map to obtain multiple eighth feature maps, and the multiple eighth feature maps are linearly combined to obtain the ninth feature map; different levels of features are extracted from the ninth feature map, and feature fusion processing is performed on the extracted features to obtain the third-level feature map.
[0119] After obtaining hierarchical feature maps at different levels, these hierarchical feature maps can be fused to obtain target-scale feature maps with multi-scale information, thereby improving the accuracy of target detection based on the target-scale feature maps.
[0120] In a possible implementation, the feature fusion processing of hierarchical feature maps at different levels to obtain scale feature maps at different scales may include:
[0121] Static and dynamic context features are extracted from the third-level feature map, and feature fusion processing is performed on the static and dynamic context features to obtain a fourth-level feature map. Feature fusion processing is performed on the second-level and fourth-level feature maps to obtain a first-scale feature map. Feature fusion processing is performed on the first-scale feature map and the first-level feature map to obtain a second-scale feature map. Feature fusion processing is performed on the first-scale feature map and the second-scale feature map to obtain a third-scale feature map. Feature fusion processing is performed on the fourth-level and third-scale feature maps to obtain a fourth-scale feature map. The first-scale feature map, the second-scale feature map, the third-scale feature map, and the fourth-scale feature map are determined as scale feature maps of different scales.
[0122] For example, you can refer to Figure 2 As shown in the encoder section of the intermediate transformer, the output of the last HGBlock module in the backbone network (the third-level feature map) is input into the Conv (Convolutional) module. After processing by the Conv module, the result is passed to the CoT (Contextual Transformer) module to obtain a new feature map. The output of the CoT module (the fourth-level feature map) and the output of the 6th layer HGBlock in the backbone network (the second-level feature map) are each passed through a Conv module and then concatenated to obtain the output of the 13th layer (the first-scale feature map). The output of the 4th layer HGBlock (the first-level feature map) is passed through a Conv module and then concatenated with the output of the 13th layer to obtain the output of the 14th layer (the second-scale feature map). Subsequently, the output of the 14th layer is concatenated with the output of the 13th layer to obtain the output of the 15th layer (the third-scale feature map). Finally, the output of the 12th layer is concatenated with the output of the 15th layer to obtain the output of the 16th layer (the fourth-scale feature map).
[0123] The structure of the CoT module is as follows: Figure 5As shown, the processing steps are as follows: For each query pixel x, CoT first applies a 3×3 convolution operation to all 3×3 adjacent keys in the grid, thereby extracting feature information k1 from the neighboring regions of the keys. k1 reflects the static context information of local neighboring positions, serving as the static context representation of x. Subsequently, a concat operation is used to process the connection between the query and k1. Based on this, the attention matrix W is obtained through two consecutive 1×1 convolutions. Therefore, W captures the interrelationship between query features and adjacent key features, relying on static context information rather than isolated query-key pairs for self-attention learning. Furthermore, the feature map k2 is calculated by aggregating W and the input value V, capturing the dynamic feature interactions between the inputs, referred to as the dynamic context representation of x. Finally, the sum of k1 and k2 forms the fused output of the CoT block.
[0124] Since the CoT module can achieve high-level semantic feature representation that incorporates contextual information, the above approach can further enhance the model's generalization ability.
[0125] In a possible implementation, the step of performing feature stitching on feature maps of different scales to obtain a target multi-scale feature map may include:
[0126] The second-scale feature map, the third-scale feature map, and the fourth-scale feature map are subjected to feature stitching to obtain a target multi-scale feature map.
[0127] For example, you can refer to Figure 2 As shown in the encoder section of the converter, the target multi-scale feature map is obtained by concatenating the outputs of three different scales from layers 14, 15, and 16.
[0128] In a possible implementation, determining the location and category information of each chicken in the chicken flock image based on the target multi-scale feature map may include:
[0129] For each feature in the target multi-scale feature map, determine the intersection-union score and classification score of the feature; determine the target feature based on the preset intersection-union score threshold, the preset classification score threshold, the intersection-union score, and the classification score; determine the location information and category information of each chicken in the chicken flock image based on the target feature.
[0130] The intersection-union score threshold and the classification score threshold can be set according to the actual situation, and this embodiment does not impose any restrictions on them.
[0131] To verify the beneficial effects of the multi-scale target detection method for free-range chickens based on a large model disclosed in this embodiment, 14 scene data were collected from two sites and divided into a training set, a validation set, a test set, and three independent test sets containing unknown scenes to verify the detection accuracy and generalization performance of the target detection model in complex environments.
[0132] 1. Ablation experiment of SDTM module
[0133] 1.1 Location Comparison of SDTM Modules
[0134] The backbone network of the original RT-DETR (Real-Time Detection Transformer) model uses a simple 2x downsampling operation (DWConv module) in layers 2, 4, and 8, which may lead to the loss of information about small targets. This proposal attempts to replace the DWConv module of the original model with SDTM, resulting in 7 strategies. The AP50 and AP95 results obtained for each strategy, as well as their average growth on 4 test sets, are shown in Table 1.
[0135] Table 1 Results of using different replacement strategies with SDTM
[0136]
[0137] Replacing DWConv with a single SDTM yields three strategies, located at layers 2, 4, and 8 respectively. As shown in rows 2-4 of Table 1, all of these strategies achieve better results than the original model (row 1 of Table 1). Specifically, the strategies using a single SDTM perform well on both test sets 1 and 2, with the SDTM at layer 8 achieving the best results on both sets. However, these strategies do not perform well on test sets 3 and 4, which have significantly different distributions from the training set. The SDTMs at layers 4 and 8 perform slightly better on test set 3 than those at layer 2, but significantly worse on test set 4. This is mainly because test set 4 is an indoor setting with severely densely occluded chickens, making detection more challenging compared to test set 3. In terms of average growth, the SDTM in layer 2 outperformed the layers 4 and 8 with an average growth of 15.43%. This means that using SDTM in deep networks may have a negative impact on the generalization of the model and is more suitable for shallower layers of the network.
[0138] When using two SDTMs (rows 5-7 of Table 1), the model achieved varying degrees of improvement. The SDTMs located in layers 2 and 4 achieved the best average growth and were optimal on test set 4, demonstrating a good balance between accuracy and generalization. The SDTMs located in layers 2 and 8, and layers 4 and 8, exhibited poorer generalization on test sets 2, 3, and 4, further illustrating that SDTMs are better suited for shallower layers in the network. Furthermore, using three SDTMs (row 8 of Table 1) placed in layers 2, 4, and 8 yielded suboptimal average growth and was optimal on test set 3, but significantly increased the number of model parameters.
[0139] In summary, placing SDTM in shallow layers demonstrates greater advantages, primarily because SDTM not only captures detailed features from shallow layers but also supplements the entire network with global information. However, placing SDTM in deeper layers may lead to gradient explosion, affecting the model's generalization ability. Based on the performance of different improvement strategies on four test sets and the statistical average growth, this proposal chooses to replace the DWConv nodes in layers 2 and 4 of the original model with two SDTMs as the improved model. The improved model achieved the highest average growth of 22.45% with an increase of 5.85M parameters compared to the original model. The improved model not only achieved the best results on test set 4 but also showed significant improvements on the other three test sets, indicating stronger convergence and generalization capabilities.
[0140] 1.2 Effectiveness of the SDTM module structure
[0141] To further demonstrate the effectiveness of the SDTM construction strategy, an ablation experiment of SDTM was performed, and the results are shown in Table 2.
[0142] Table 2 shows the ablation experiment results of the SDTM module.
[0143]
[0144] Among them, SDTM w / o BiFormer means removing the BiFormer part from SDTM, SDTM w / o SPD means removing the SPD part from SDTM, and SDTM w / o Fusion means removing the Fusion part from SDTM.
[0145] When the BiFormer part is removed from SDTM, compared to the full SDTM model, the AP50 decreases by 0.7%, 1.4%, and 19.8% on test sets 2, 3, and 4, respectively. These results demonstrate the beneficial contribution of the BiFormer part to the model's generalization ability, improving its detection capability in unknown scenarios. When SPD and deconvolution operations are removed from SDTM, the model's FPS decreases by 47.1%. In fact, SPD shrinks the feature map while increasing the number of channels, which helps accelerate the transformer's computation speed due to the reduced number of attention loops and inner product iterations. Therefore, the full SDTM design has been shown to effectively improve the computational efficiency of BiFormer. Furthermore, the model without SPD loses a significant amount of detailed features, resulting in poor detection performance on test sets 3 and 4, accompanied by AP50 decreases of 57.2% (-15.6%) and 48.8% (-21.0%), respectively. This further demonstrates that the full SDTM design effectively balances the model's computational efficiency and performance. Experimental results show that the fusion module of SDTM also plays an important role, enabling the interaction and exploration of local and global features. Therefore, the SDTM designed in this proposal can not only fully mine the local and global features of the target, but also improve the model's generalization ability in unknown scenarios and enhance the computational efficiency of the transformer.
[0146] After improving the RT-DETR backbone network based on SDTM, the model's AP (Average Performance) significantly improved across four test sets. However, the prediction capabilities on test sets 3 and 4 require further improvement. The original RT-DETR model uses a traditional self-attention mechanism in the network's transformer encoder part, which may suffer from insufficient feature map representation capabilities, especially for dense occlusion prediction tasks. Therefore, to further enhance the model's generalization ability while maintaining the improved model's prediction capabilities, a context-sensitive self-attention module (CoT) was selected and added to the improved model. Table 3 shows the impact of different attention modules on the model's recognition results.
[0147] Table 3 Comparison results of different attention modules used in the converter encoder
[0148]
[0149] In test set 1, all attention modules maintained the predictive ability of the improved model. PSA and EMA detection performance decreased by approximately 1% in test set 2. However, only CoT achieved a performance improvement of 75.7% (+2.9%) in AP50 in test set 3. Notably, for test set 4, which features highly variable and dense scene changes, only CoT maintained detection accuracy, while other attention modules showed varying degrees of decline. Overall, adding the CoT module increased the model's AP50 by a total of 3.1% across the four test sets and further improved the model's generalization ability.
[0150] 3. Ablation experiments of the proposed EMSC-DETR model (i.e., the improved model of RT-DETR, also known as the target detection model in this disclosure).
[0151] The results of the ablation experiment are shown in Table 4.
[0152] Table 4 Ablation Experiment Results of EMSC-DETR Model
[0153]
[0154] Table 4 shows that all modules and combinations of modules are effective. Compared to the original RT-DETR-L, EMSC-DETR improved AP50 by 1.2%, 8.2%, 9.2%, and 28.3% on the four test sets, respectively, with increases of 5.63M parameters and 13.31 GFLOPs. Since the original RT-DETR-L already demonstrated high detection performance with an AP50 of 97.4% on test set 1, it was not easy for EMSC-DETR to further improve detection performance and achieve an AP50 of 98.6% on test set 1. Clearly, EMSC-DETR achieved significant performance improvements on the other three independent datasets, demonstrating excellent generalization ability compared to the original model.
[0155] Replacing the original model's backbone network with the proposed SDTM significantly improves model performance. RT-DETR-L+SDTM achieved the best detection results on test set 1 and demonstrated good adaptability to object detection in unknown scenes on other test sets. Modifying the transformer encoder to CoT alone, RT-DETR-L+CoT improved the accuracy of individual chicken identification in densely occluded flocks, achieving improvements of 0.6%, 3.2%, and 8.8% on test sets 1, 2, and 4, respectively. EMSC-DETR utilizes both SDTM and CoT modules, achieving deep fusion of local and global information at the shallow layer and enhancing the representation of deep feature maps through contextual information. Therefore, EMSC-DETR not only maintained its outstanding performance on test set 1 but also achieved the best detection performance on the other three test sets containing complex unknown scenes.
[0156] The training process of the four models is as follows Figure 6 As shown, the original RT-DETR-L exhibited significant oscillations in the later stages of training, which may be detrimental to model convergence. Furthermore, RT-DETR-L+CoT also showed a few large oscillations in the early stages of model training. However, the RT-DETR-L+SDTM and EMSC-DETR models maintained a steady pace of val loss decrease and AP50 increase, which may be beneficial for the model's continued learning and convergence.
[0157] 4. Comparison of EMSC-DETR with different models
[0158] To further validate the advantages of the proposed EMSC-DETR model in detection performance and generalization, 15 detection models were subjected to the same experiments and their overall performance was compared with EMSC-DETR. The comparison models included both CNN-based and Transformer-based models. YOLOv7, YOLOF, PPYOLOE+, Faster R-CNN, Sparse R-CNN, Tood, Ddod, RetinaNet, and VFNet were selected as CNN-based models. For Transformer-based models, ViTDet, Swin Transformer, DETR, Deformable DETR, and DINO were selected. To ensure fairness, the selected comparison models had a similar number of parameters or GFLOPs as EMSC-DETR. The detection results obtained by all comparison models on the four test sets are shown in Table 5.
[0159] Table 5 Comparison of EMSC-DETR with different models
[0160]
[0161] As shown in Table 5, among CNN-based models, the YOLO series models consistently outperformed the EMSC-DETR model in both detection capability and generalization performance. Specifically, YOLOv7 achieved AP50 scores of 76.0% and 66.4% on test sets 1 and 2, respectively, significantly lower than EMSC-DETR (98.6% and 95.7%). On test sets 3 and 4, which contained a large number of large targets, YOLOv7 (62.5% and 46.7%) performed well, significantly outperforming YOLOF (40.9% and 41.4%) and PPYOLOE+-L (42.9% and 17%), but still 13.2% and 23.1% lower than EMSC-DETR, respectively. Notably, Faster R-CNN-R101 demonstrated good generalization performance on test sets 3 and 4, achieving AP50 scores of 79.3% and 67.7%, respectively. However, based on the AP95 metric, Faster R-CNN-R101 performs worse than EMSC-DETR on test set 3, and Faster R-CNN-R101's GFLOPs are nearly twice that of EMSC-DETR. These results indicate that CNNs-based models are easily influenced by large targets during training, leading to poor detection of small targets when faced with multiple targets of varying sizes. Therefore, many CNNs-based models perform poorly on test sets 1 and 2 (containing many small targets), but perform better on test sets 3 and 4 (containing many large targets). Furthermore, compared to CNNs-based models, the proposed EMSC-DETR demonstrates advantages in detection capability, generalization to new scenes, and GFLOPs. Looking at individual test sets, EMSC-DETR outperforms YOLOF (ranked second in test sets 1 and 2) by 3.2% and 4.1% respectively, and AP95 (ranked second in test sets 3 and 4) by 2.9% and 6.6% respectively.
[0162] Among Transformer-based models, DINO outperforms ViTDet-B, SwinTransformer, DETR, and Deformable DETR on four test sets, but its AP50 is 3.2%, 4.8%, 0.6%, and 7.7% lower than the proposed EMSC-DETR, respectively, and it has more than twice as many GFLOPs as EMSC-DETR. DETR converges slowly with the same number of training epochs as other models, and it is very unfriendly to test sets 1 and 2, which contain many small targets, but it performs competitively on test sets 3 and 4, which contain many large targets. ViTDet performs well but with a large number of parameters and GFLOPs. Swin Transformer achieves poor detection results on test sets 1 and 2, possibly because its window movement operation disperses region features into non-overlapping windows, making it difficult for the model to learn complete region features. In summary, among all Transformer-based models, EMSC-DETR exhibits the best detection performance and the best generalization performance, along with the lowest number of parameters and GFLOPs.
[0163] Comparative experiments demonstrate that the proposed EMSC-DETR has the best overall detection performance and the strongest generalization ability under different complex background conditions. Figure 7 The visualization showcases the detection results of EMSC-DETR in nine scenes included in Test Set 1. EMSC-DETR effectively identifies small, distant targets, such as chickens, demonstrating that the model minimizes the loss of features for small targets. Even with interference from clustering occlusion, ground leaves, light spots, and shadows, EMSC-DETR accurately distinguishes individual chickens and achieves superior detection results in complex shooting environments (different shooting equipment, lighting, height, and angles).
[0164] Visualization of local detection of EMSC-DETR and RT-DETR in the three scenes included in test set 2, as shown below. Figure 8 As shown. These three scenes, along with the previous nine scenes, were all captured from the same location A. However, the backgrounds for these three scenes were never present in the training set, serving to test the model's generalization ability in unknown scenarios. RT-DETR is prone to missed detections and inaccurate location predictions for occlusions, image edges, and small targets in the distance. However, EMSC-DETR also showed good detection results even in unseen scenes; that is, EMSC-DETR effectively distinguishes small targets from the background in dark, distant environments and accurately identifies occluded individual chickens.
[0165] 5. Robustness test
[0166] Robustness experiments are used to test a model's ability to remain stable in the face of changes in uncertainty and noise. Since EMSC-DETR achieved superior detection performance on Test Set 1 and Test Set 2 with AP50 scores of 98.6% and 95.7% respectively, robustness experiments were conducted on Test Set 1 and Test Set 2. Furthermore, the YOLOF model, which performed second-best on both test sets, was also selected for robustness experiments. Adding image noise is a common form of interference that can be used in robustness testing. For a total of 950 images in Test Set 1 and Test Set 2, this proposal adds 5%, 10%, 15%, 20%, 25%, and 30% salt noise to each image as robustness test data. Figure 9 Images with 10%, 20%, and 30% salt noise added are shown.
[0167] The test results of EMSC-DETR, RT-DETR-L, and YOLOF on noisy images are as follows: Figure 10 As shown, different models exhibited a decline in detection performance with increasing noise levels. For test set 1, the accuracy of EMSC-DETR and YOLOF decreased at similar rates when the noise level was less than 15%, but the accuracy of EMSC-DETR decreased significantly slower than that of YOLOF when the noise level was greater than 15%. Compared to RT-DETR-L, EMSC-DETR's accuracy decreased more slowly across all noise levels. For test set 2, RT-DETR and YOLOF had similar rates of accuracy decline, and EMSC-DETR's accuracy decreased more slowly than that of RT-DETR-L when the noise level was less than 10%. Overall, EMSC-DETR demonstrated better stability than both RT-DETR-L and YOLOF.
[0168] 6. Fine-tuning experiment
[0169] While EMSC-DETR demonstrates excellent generalization ability, it has some limitations. Compared to test sets 1 and 2, EMSC-DETR shows lower detection accuracy (AP50 below 80%) on test sets 3 and 4. Considering the needs of practical detection applications, some improvement experiments were performed on test sets 3 and 4. (Bommasani et al., 2021) proposed a method based on fine-tuning pre-trained weights to adapt the model to downstream tasks. In this proposal, the purpose of the fine-tuning experiments is to explore whether a better chicken target detection effect can be achieved with fewer new training data and training epochs using existing pre-trained weights. Therefore, the optimal model weights obtained from the previous training set are used as pre-trained weights, and fine-tuning experiments are performed on test sets 3 and 4 to further improve the model's detection performance. In addition, Faster R-CNN-R101 performs well on test sets 3 and 4, even outperforming the proposed model in AP50 on test set 3. Therefore, fine-tuning experiments were explored on EMSC-DETR and Faster R-CNN-R101.
[0170] Specifically, firstly, from the 500 images in test set 3, a fine-tuning validation set (30 images) and a fine-tuning test set (70 images) were randomly divided. Then, the remaining 400 images were divided into five fine-tuning training sets with different numbers of images: 50, 100, 200, 300, and 400 images. Similarly, test set 4 was divided in the same way. For each fine-tuning training set, it was trained for 100, 200, and 300 epochs with the obtained pre-trained weights, respectively. Finally, the model's detection results were evaluated using the fine-tuning test sets at different numbers of training images and epochs. Figure 11 The fine-tuning experimental results for EMSC-DETR and Faster R-CNN-R101 are presented. To achieve ideal detection performance (AP50 above 97.5%), EMSC-DETR can use fewer fine-tuning training images with more training epochs, or more fine-tuning training images with fewer training epochs. The Faster R-CNN-R101 model converges quickly, reaching its optimum almost at 100 epochs. However, its detection results are significantly worse than EMSC-DETR, especially in AP50 on test set 4, which is about 3% lower than EMSC-DETR. In summary, in the fine-tuning experiments, EMSC-DETR can achieve very good detection results with less new labeled data or less training cost (i.e., fewer epochs).
[0171] Figure 12 and Figure 13The results of the fine-tuning experiments are visualized on test sets 3 and 4, respectively. Before the fine-tuning experiments, the model exhibited missed detections, false detections, and inaccurate localization. However, by training the model on 50 images for 300 epochs or 400 images for 100 epochs, the fine-tuning experiments enabled the model to demonstrate outstanding chicken target detection capabilities in new scenarios. Therefore, EMSC-DETR has the potential to adapt to other complex farming environments in the future and achieve accurate chicken target detection.
[0172] Based on the same concept, this disclosure also provides a multi-scale target detection device 1400 for free-range chickens based on a large model, the device 1400 including:
[0173] The acquisition module 1401 is used to acquire images of cageless chickens.
[0174] Processing module 1402 is used to input the chicken flock image into the target detection model to obtain the location information and category information of each chicken in the chicken flock image;
[0175] The target detection model may include:
[0176] The feature extraction module is used to extract features from the chicken flock image at different levels to obtain hierarchical feature maps at different levels.
[0177] The feature fusion module is used to perform feature fusion processing on hierarchical feature maps of different levels to obtain scale feature maps of different scales, and to perform feature stitching processing on scale feature maps of different scales to obtain target multi-scale feature maps.
[0178] The determination module is used to determine the location and category information of each chicken in the chicken flock image based on the target multi-scale feature map.
[0179] In a possible implementation, the feature extraction module may include:
[0180] The first-level feature extraction submodule is used to perform first-level feature extraction on the chicken flock image to obtain a first-level feature map;
[0181] The second-level feature extraction submodule is used to perform second-level feature extraction on the first-level feature map to obtain a second-level feature map;
[0182] The third-level feature extraction submodule is used to perform third-level feature extraction on the second-level feature map to obtain the third-level feature map;
[0183] The first determining submodule is used to determine the first-level feature map, the second-level feature map, and the third-level feature map as hierarchical feature maps of different levels.
[0184] In a possible implementation, the first-level feature extraction submodule may include:
[0185] A downsampling unit is used to downsample the chicken image to obtain a first feature map;
[0186] The first feature fusion unit is used to extract features at different levels from the first feature map and perform feature fusion processing on the extracted features at different levels to obtain the second feature map.
[0187] The second feature fusion unit is used to extract depth features from the second feature map and perform feature fusion processing on the extracted depth features to obtain a third feature map.
[0188] The third feature fusion unit is used to extract features of different levels from the third feature map and perform feature fusion processing on the extracted features of different levels to obtain the first level feature map.
[0189] In a possible implementation, the second-level feature extraction submodule may include:
[0190] The fourth feature fusion unit is used to extract the depth features from the first-level feature map and perform feature fusion processing on the extracted depth features to obtain the fourth feature map.
[0191] The fifth feature fusion unit is used to extract features at different levels from the fourth feature map and perform feature fusion processing on the extracted features at different levels to obtain the fifth feature map.
[0192] The sixth feature fusion unit is used to extract features of different levels from the fifth feature map and perform feature fusion processing on the extracted features of different levels to obtain the second-level feature map.
[0193] In a possible implementation, the third-level feature extraction submodule may include:
[0194] The seventh feature fusion unit is used to extract features from different levels in the second-level feature map and perform feature fusion processing on the extracted features to obtain the sixth feature map.
[0195] The feature combination unit is used to perform a convolution operation on each input channel of the sixth feature map to obtain multiple seventh feature maps, and to perform a pointwise convolution operation on each of the seventh feature maps to obtain multiple eighth feature maps, and to linearly combine the multiple eighth feature maps to obtain a ninth feature map.
[0196] The eighth feature fusion unit is used to extract different levels of features from the ninth feature map and perform feature fusion processing on the extracted different levels of features to obtain the third level feature map.
[0197] In a possible implementation, the second feature fusion unit may include:
[0198] The first processing subunit is used to extract features from the second feature map to obtain a first feature submap, and to perform downsampling processing on the first feature submap to obtain a second feature submap.
[0199] The second processing subunit is used to extract features from the second feature map to obtain a third feature submap, perform downsampling processing on the third feature submap to obtain a fourth feature submap, and perform feature extraction processing on the fourth feature submap to obtain a fifth feature submap.
[0200] The third processing subunit is used to downsample the fourth feature sub-image to obtain the sixth feature sub-image, extract the key features from the sixth feature sub-image to obtain the seventh feature sub-image, and perform deconvolution operation on the seventh feature sub-image to obtain the eighth feature sub-image.
[0201] The fourth processing subunit is used to perform feature splicing processing on the second feature sub-image, the fifth feature sub-image, and the eighth feature sub-image to obtain the third feature image.
[0202] In a possible implementation, the feature fusion module may include:
[0203] The first feature fusion submodule is used to extract static context features and dynamic context features from the third-level feature map, and to perform feature fusion processing on the static context features and the dynamic context features to obtain the fourth-level feature map.
[0204] The second feature fusion submodule is used to perform feature fusion processing on the second-level feature map and the fourth-level feature map to obtain a first-scale feature map;
[0205] The third feature fusion submodule is used to perform feature fusion processing on the first scale feature map and the first level feature map to obtain the second scale feature map.
[0206] The fourth feature fusion submodule is used to perform feature fusion processing on the first scale feature map and the second scale feature map to obtain the third scale feature map;
[0207] The fifth feature fusion submodule is used to perform feature fusion processing on the fourth-level feature map and the third-scale feature map to obtain the fourth-scale feature map;
[0208] The second determining submodule is used to determine the first scale feature map, the second scale feature map, the third scale feature map, and the fourth scale feature map as scale feature maps of different scales.
[0209] In a possible implementation, the feature fusion module may further include:
[0210] The feature stitching module is used to perform feature stitching processing on the second-scale feature map, the third-scale feature map, and the fourth-scale feature map to obtain a target multi-scale feature map.
[0211] In a possible implementation, the determining module may include:
[0212] The third determining submodule is used to determine the intersection-union score and classification score of each feature in the target multi-scale feature map;
[0213] The fourth determination submodule is used to determine the target feature based on a preset intersection-union score threshold, a preset classification score threshold, the intersection-union score, and the classification score;
[0214] The fifth determining submodule is used to determine the location and category information of each chicken in the chicken flock image based on the target features.
[0215] Regarding the apparatus in the above embodiments, the specific manner in which each module performs its operation has been described in detail in the embodiments related to the method, and will not be elaborated upon here.
[0216] The preferred embodiments of this disclosure have been described in detail above with reference to the accompanying drawings. However, this disclosure is not limited to the specific details of the above embodiments. Within the scope of the technical concept of this disclosure, various simple modifications can be made to the technical solutions of this disclosure, and these simple modifications all fall within the protection scope of this disclosure.
[0217] It should also be noted that the various specific technical features described in the above specific embodiments can be combined in any suitable manner without contradiction. In order to avoid unnecessary repetition, this disclosure will not describe the various possible combinations separately.
[0218] Furthermore, various different embodiments of this disclosure can be combined in any way, as long as they do not violate the spirit of this disclosure, they should also be regarded as the content disclosed in this disclosure.
Claims
1. A multi-scale target detection method for free-range chickens based on a large model, characterized in that, The method includes: Obtain images of cageless chickens; The chicken flock image is input into the target detection model to obtain the location and category information of each chicken in the chicken flock image; The target detection model is used to obtain the location and category information of each chicken in the chicken flock image through the following operations: The chicken flock image is subjected to feature extraction at different levels to obtain hierarchical feature maps at different levels. This includes: downsampling the chicken flock image to obtain a first feature map; extracting different levels of features from the first feature map and performing feature fusion processing on the extracted different levels of features to obtain a second feature map; extracting features from the second feature map to obtain a first feature sub-map, and downsampling the first feature sub-map to obtain a second feature sub-map; extracting features from the second feature map to obtain a third feature sub-map, downsampling the third feature sub-map to obtain a fourth feature sub-map, and extracting features from the fourth feature sub-map to obtain a fifth feature sub-map; and downsampling the fourth feature sub-map to obtain a sixth feature sub-map. The key features of the sixth feature sub-image are extracted to obtain the seventh feature sub-image, and the seventh feature sub-image is upsampled to obtain the eighth feature sub-image. Feature fusion processing is performed on the second feature sub-image, the fifth feature sub-image, and the eighth feature sub-image to obtain the third feature image. Different levels of features are extracted from the third feature image, and feature fusion processing is performed on the extracted features to obtain the first level feature image. Second-level feature extraction is performed on the first level feature image to obtain the second level feature image. Third-level feature extraction is performed on the second level feature image to obtain the third level feature image. The first level feature image, the second level feature image, and the third level feature image are then identified as hierarchical feature images of different levels. Feature fusion processing is performed on hierarchical feature maps of different levels to obtain scale feature maps of different scales. Feature stitching processing is then performed on these scale feature maps to obtain a target multi-scale feature map. This includes: extracting static and dynamic context features from the third-level feature map and performing feature fusion processing on these features to obtain a fourth-level feature map; performing feature fusion processing on the second-level and fourth-level feature maps to obtain a first-scale feature map; performing feature fusion processing on the first-scale and first-level feature maps to obtain a second-scale feature map; performing feature fusion processing on the first-scale and second-scale feature maps to obtain a third-scale feature map; performing feature fusion processing on the fourth-level and third-scale feature maps to obtain a fourth-scale feature map; and defining the first-scale, second-scale, third-scale, and fourth-scale feature maps as scale feature maps of different scales. The location and category information of each chicken in the chicken flock image are determined based on the target multi-scale feature map.
2. The method according to claim 1, characterized in that, The step of performing second-level feature extraction on the first-level feature map to obtain a second-level feature map includes: Extract the depth features from the first-level feature map, and perform feature fusion processing on the extracted depth features to obtain the fourth feature map; Different levels of features are extracted from the fourth feature map, and feature fusion processing is performed on the extracted features to obtain the fifth feature map; Different levels of features are extracted from the fifth feature map, and feature fusion processing is performed on the extracted different levels of features to obtain the second level feature map.
3. The method according to claim 1, characterized in that, The step of performing third-level feature extraction on the second-level feature map to obtain a third-level feature map includes: Extract features from different levels in the second-level feature map, and perform feature fusion processing on the extracted features to obtain the sixth feature map; A convolution operation is performed on each input channel of the sixth feature map to obtain multiple seventh feature maps, and a pointwise convolution operation is performed on each of the seventh feature maps to obtain multiple eighth feature maps. The multiple eighth feature maps are then linearly combined to obtain a ninth feature map. Different levels of features are extracted from the ninth feature map, and feature fusion processing is performed on the extracted features to obtain the third level feature map.
4. The method according to claim 1, characterized in that, The process of concatenating feature maps of different scales to obtain a multi-scale feature map of the target includes: The second-scale feature map, the third-scale feature map, and the fourth-scale feature map are subjected to feature stitching to obtain a target multi-scale feature map.
5. The method according to any one of claims 1-4, characterized in that, The step of determining the location and category information of each chicken in the chicken flock image based on the target multi-scale feature map includes: For each feature in the target multi-scale feature map, determine the intersection-union score and classification score of the feature; The target features are determined based on a preset intersection-union score threshold, a preset classification score threshold, the intersection-union score, and the classification score. Based on the target features, determine the location and category information of each chicken in the chicken flock image.
6. A multi-scale target detection device for free-range chickens based on a large model, characterized in that, The device includes: The acquisition module is used to acquire images of cageless chicken flocks; The processing module is used to input the chicken flock image into the target detection model to obtain the location information and category information of each chicken in the chicken flock image; The target detection model includes: The feature extraction module is used to extract features from the chicken flock image at different levels to obtain hierarchical feature maps at different levels. This includes: downsampling the chicken flock image to obtain a first feature map; extracting features at different levels from the first feature map and performing feature fusion processing on the extracted features at different levels to obtain a second feature map; extracting features from the second feature map to obtain a first feature sub-map, and downsampling the first feature sub-map to obtain a second feature sub-map; extracting features from the second feature map to obtain a third feature sub-map, downsampling the third feature sub-map to obtain a fourth feature sub-map, and extracting features from the fourth feature sub-map to obtain a fifth feature sub-map; and downsampling the fourth feature sub-map to obtain a third feature sub-map. A sixth feature sub-image is generated. Key features are extracted from the sixth feature sub-image to obtain a seventh feature sub-image. The seventh feature sub-image is then upsampled to obtain an eighth feature sub-image. Feature fusion processing is performed on the second, fifth, and eighth feature sub-images to obtain a third feature image. Different levels of features are extracted from the third feature image, and feature fusion processing is performed on the extracted features to obtain a first-level feature image. Second-level feature extraction is performed on the first-level feature image to obtain a second-level feature image. Third-level feature extraction is performed on the second-level feature image to obtain a third-level feature image. The first-level feature image, the second-level feature image, and the third-level feature image are then identified as hierarchical feature images of different levels. The feature fusion module is used to perform feature fusion processing on hierarchical feature maps of different levels to obtain scale feature maps of different scales, and to perform feature stitching processing on scale feature maps of different scales to obtain a target multi-scale feature map. This includes: extracting static and dynamic context features from the third-level feature map, and performing feature fusion processing on the static and dynamic context features to obtain a fourth-level feature map; performing feature fusion processing on the second-level and fourth-level feature maps to obtain a first-scale feature map; performing feature fusion processing on the first-scale feature map and the first-level feature map to obtain a second-scale feature map; performing feature fusion processing on the first-scale feature map and the second-scale feature map to obtain a third-scale feature map; performing feature fusion processing on the fourth-level and third-scale feature maps to obtain a fourth-scale feature map; and determining the first-scale feature map, the second-scale feature map, the third-scale feature map, and the fourth-scale feature map as scale feature maps of different scales. The determination module is used to determine the location and category information of each chicken in the chicken flock image based on the target multi-scale feature map.