Plateau livestock quantity intelligent identification method
By using a dual-spectral imaging device and cross-modal feature interaction technology, the robustness problem of livestock number identification under extreme lighting conditions in plateau areas has been solved, achieving high-precision, all-weather, adaptive intelligent monitoring that can adapt to environmental changes and meet real-time processing requirements.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- QINGHAI PROVINCIAL BRANCH OF PICC PROPERTY & CASUALTY CO LTD
- Filing Date
- 2026-02-14
- Publication Date
- 2026-06-12
AI Technical Summary
Traditional livestock counting methods fail to achieve robust target detection and counting in high-altitude areas due to extreme optical interference that renders visible light images ineffective. Furthermore, existing multimodal fusion methods cannot adapt to drastic fluctuations in lighting conditions, and embedded systems are sensitive to computational resources and power consumption, making it difficult to meet real-time processing requirements.
A dual-spectral imaging device is used to simultaneously acquire visible light and thermal infrared images. Through a spatiotemporally aligned dual-channel feature extraction mechanism, combined with a cross-modal feature interaction module and a target detection network, robust detection and accurate counting of livestock targets are achieved, and an online adaptive update mechanism is used to adapt to environmental changes.
It achieves high-precision identification of livestock numbers under extreme lighting conditions, improves recall rate and positioning accuracy, has all-weather monitoring capabilities, and maintains model generalization ability through an adaptive update mechanism to meet real-time processing requirements.
Smart Images

Figure REF-OBJ-1771056933826-000011 
Figure REF-OBJ-1771056933826-000012 
Figure REF-OBJ-1771056933826-000013
Abstract
Description
Technical Field
[0001] This invention belongs to the field of artificial intelligence and image recognition, and specifically relates to a method for intelligent identification of the number of livestock on plateaus. Background Technology
[0002] With the rapid development of smart animal husbandry in complex geographical environments such as plateaus and frigid regions, the demand for automated and intelligent livestock counting is becoming increasingly urgent. Traditional livestock counting mainly relies on manual inspections or image acquisition and analysis using a single visible light camera, based on the core assumption of stable lighting conditions and clear target outlines. However, plateau regions are characterized by intense solar radiation, frequent low light levels at dawn and dusk, diffuse fog, and surface glare, leading to problems such as overexposure, underexposure, sudden drops in contrast, or blurred targets in visible light images. This severely weakens the robustness and accuracy of vision-based target detection and counting algorithms.
[0003] Multimodal sensing technology offers a new approach to overcoming the limitations of single sensors. Thermal imaging sensors, by capturing the infrared radiation of objects themselves, can effectively distinguish living organisms from the background in complete darkness or strong glare, possessing excellent all-weather capabilities. Visible light sensors, on the other hand, provide rich texture and color information under normal lighting conditions, which is beneficial for fine-grained recognition. Theoretically, fusing the two can balance environmental adaptability and recognition accuracy. However, existing fusion methods mostly employ simple weighting, feature stitching, or fixed rule-based decision-making, lacking the ability to perceive the dynamic changes in modal reliability under different environmental conditions, making it difficult to achieve adaptive, high-confidence livestock number determination in high-altitude scenes with drastic lighting fluctuations.
[0004] The following problems still exist in the identification of livestock on the plateau: the visible light mode suffers severe feature degradation under strong glare or fog, and the thermal imaging mode is prone to adhesion and missed detection when the temperature difference is small or the herd is dense; the fixed fusion strategy cannot dynamically adjust the mode weights according to the real-time environment, resulting in a sharp decline in the performance of the fusion results under certain working conditions; in addition, embedded systems deployed in the field are highly sensitive to computing resources and power consumption, and existing deep fusion models often have a large number of parameters and high inference latency, making it difficult to meet the real-time processing requirements of the edge. Summary of the Invention
[0005] This invention provides an intelligent method for identifying the number of livestock on plateaus, aiming to solve the problem of the failure of single visible light images under extreme lighting conditions (strong glare, twilight, fog). This method constructs a multimodal perception fusion architecture, simultaneously acquiring visible light and thermal infrared image data, and based on a spatiotemporally aligned dual-channel feature extraction mechanism, achieves robust detection and accurate counting of livestock targets under complex lighting conditions on plateaus.
[0006] As one embodiment of the present invention, the intelligent identification method for livestock numbers in high-altitude areas includes the following steps: a dual-spectrum imaging device deployed in the monitoring area of high-altitude pastures simultaneously acquires visible light image sequences and thermal infrared image sequences within the same field of view; the visible light image sequences and thermal infrared image sequences are spatiotemporally synchronized and calibrated to generate a dual-modal image pair with timestamp alignment and consistent spatial coordinates; the visible light image and thermal infrared image in the dual-modal image pair are independently feature-encoded to obtain visible light feature maps and thermal infrared feature maps; the visible light feature maps and thermal infrared feature maps are input into a cross-modal feature interaction module to perform channel attention weighting and spatial position alignment operations to generate a joint feature map that fuses complementary information from both modes; based on the joint feature map, the bounding boxes of all individual livestock in the image are located through a target detection network, and duplicate detections and false detections are eliminated according to the spatial distribution density and morphological constraints of the bounding boxes; the number of effective bounding boxes after post-processing is counted, and the final livestock number identification result is output.
[0007] Furthermore, the dual-spectrum imaging device includes a coaxially mounted visible light camera and a thermal infrared camera, both sharing the same optical center. Their lens focal lengths are calibrated and matched to ensure that the overlap of the imaging field of view is not less than 95% within a preset monitoring distance. The visible light camera uses a global shutter photosensitive element and has a wide dynamic range imaging capability of not less than 120dB, used to preserve image details under strong glare or low illumination conditions. The thermal infrared camera uses an uncooled vanadium oxide microbolometer focal plane array with a thermal sensitivity of not more than 50 milliklvin and a spatial resolution of 640×480 pixels, used to capture the thermal radiation signal of livestock body surface in dawn, dusk, fog, or nighttime environments.
[0008] Furthermore, the spatiotemporal synchronization calibration specifically includes: synchronously starting the frame exposure timing of the visible light camera and the thermal infrared camera using a hardware trigger signal, so that the timestamp deviation between the two images is controlled within 10 milliseconds; performing an affine transformation on the thermal infrared image based on a pre-calibrated binocular extrinsic matrix, so that its pixel coordinate system is consistent with that of the visible light image; and using an image registration algorithm based on maximizing mutual information to perform sub-pixel-level spatial alignment on each frame of dual-modal image pair, ensuring that the center position offset of the same livestock target in the two images does not exceed one pixel.
[0009] Furthermore, the feature encoding process employs a dual-branch convolutional neural network structure. The visible light branch consists of an improved ResNet-50 backbone network, with its first convolutional layer replaced by a learnable multi-scale receptive field convolutional module to enhance the response capability to livestock targets of different sizes. The thermal infrared branch consists of a lightweight MobileNetV3 network, with the dilation rate of its depth-separable convolutional layers adjusted according to the low-texture characteristics of thermal imaging, and the dilation rate set to 2. Both branches output feature maps in the fourth stage, with a spatial resolution of 1 / 16 of the input image and channel dimensions of 2048 and 512, respectively.
[0010] Furthermore, the cross-modal feature interaction module includes a channel attention fusion unit and a spatial alignment refinement unit. The channel attention fusion unit first performs global average pooling on the visible light feature map and the thermal infrared feature map respectively to generate their respective channel description vectors. Then, the two description vectors are concatenated and input into a two-layer fully connected network to output the weight coefficients of each channel, which are applied to the original feature map to achieve importance weighting between modes. The spatial alignment refinement unit adopts a deformable convolution structure and uses the weighted visible light feature map as a reference to guide the thermal infrared feature map to perform local geometric deformation compensation, eliminating edge misalignment caused by the difference in physical characteristics of dual-spectrum imaging.
[0011] Furthermore, the target detection network is an improved MMYOLO target detection framework, in which a bidirectional feature pyramid structure is introduced in the neck network, fusing multi-scale joint feature maps from the cross-modal interaction module; the head detection predicts the target center point offset, bounding box size, and confidence score respectively; during the training phase, a weighted combination of the focus loss function and the generalized intersection-union loss function is adopted, with a weight ratio of 1:1.5, to alleviate the problem of positive and negative sample imbalance caused by the dense arrangement of livestock in plateau scenes.
[0012] Furthermore, the specific operations for removing duplicate and false detection targets include: performing non-maximum suppression on all detected bounding boxes, with the intersection-union ratio threshold set to 0.4; calculating the aspect ratio and area of each retained bounding box, and if the aspect ratio is less than 0.3 or greater than 3.0, or the area is less than a preset minimum threshold (corresponding to 30% of the projected area of an adult yak), it is determined to be a false detection and removed; for cases where the center distance between adjacent bounding boxes is less than a preset threshold (corresponding to 1.2 times the shoulder width of an adult yak), the temperature continuity in the thermal infrared image is used to determine whether they belong to the same individual animal, and if the temperature gradient change is gradual, they are merged into a single individual.
[0013] Furthermore, the method also includes an online adaptive update mechanism for the model: during continuous monitoring, samples of the true number of livestock that have been manually verified are collected periodically to construct an incremental training set; when the number of new samples accumulates to a preset batch size, the backbone network parameters are frozen, and only the weights of the detection head and the cross-modal interaction module are finely adjusted. The momentum gradient descent method is used for local optimization for no more than 10 iterations to adapt to changes in livestock size, coat color and background vegetation caused by seasonal changes.
[0014] Compared with the prior art, the beneficial effects of the present invention are as follows:
[0015] By simultaneously fusing visible light and thermal infrared dual-modal images, the problem of information loss in single visible light imaging under extreme lighting conditions such as strong glare at high altitudes, low illumination at dawn and dusk, and fog scattering is effectively overcome.
[0016] The constant thermal radiation characteristics provided by thermal infrared images are unaffected by ambient light, ensuring the continuity and stability of all-weather monitoring.
[0017] The cross-modal feature interaction mechanism enables fine-grained alignment and complementarity of the two modalities in the channel and spatial dimensions, significantly improving the recall rate and localization accuracy of target detection;
[0018] A post-processing strategy combining morphological constraints and temperature continuity criteria effectively suppresses duplicate counting and interference from non-biological heat sources in dense scenes.
[0019] The online adaptive update mechanism ensures the model's generalization ability in long-term operation and avoids performance degradation caused by dynamic changes in the environment. Attached Figure Description
[0020] Figure 1 This is a schematic diagram of the overall technical solution architecture of the intelligent identification method for the number of livestock on plateaus proposed in this invention;
[0021] Figure 2 This is a schematic diagram of the core principle framework of the cross-modal feature interaction module in this invention;
[0022] Figure 3 This is a flowchart illustrating the logical flow of the dual-modal image spatiotemporal synchronization calibration and feature encoding in this invention.
[0023] Figure 4 This is a logical flowchart of the target detection and post-processing stage in this invention;
[0024] Figure 5 This is a schematic diagram of the multi-level interaction relationship and data flow between the dual-spectrum imaging device and the plateau pasture monitoring system in this invention;
[0025] Figure 6This is a flowchart illustrating the logical flow of the online adaptive update mechanism for the model in this invention. Detailed Implementation
[0026] Example 1: Please refer to the appendix Figures 1 to 6 This invention provides an intelligent method for identifying the number of livestock on plateaus, addressing the problem of the failure of single visible light images under extreme lighting conditions. This method constructs a multimodal perception fusion architecture, simultaneously acquiring visible light and thermal infrared image data, and utilizes a spatiotemporally aligned dual-channel feature extraction mechanism to achieve robust detection and accurate counting of livestock targets in complex lighting environments on plateaus. The specific embodiments of this invention will be described in detail below with reference to the accompanying drawings.
[0027] A dual-spectral imaging device deployed in the monitoring area of high-altitude pastures simultaneously acquires visible light and thermal infrared image sequences within the same field of view. The dual-spectral imaging device includes a coaxially mounted visible light camera and a thermal infrared camera, both sharing the same optical center. Their lens focal lengths are calibrated and matched to ensure that the overlap of the imaging field of view is no less than 95% within a preset monitoring distance. The visible light camera uses a global shutter sensor with a wide dynamic range of no less than 120dB, used to preserve image details under strong glare or low-light conditions. The thermal infrared camera uses an uncooled vanadium oxide microbolometer focal plane array with a thermal sensitivity no higher than 50 millikrvin and a spatial resolution of 640×480 pixels, used to capture the thermal radiation signals from livestock surfaces in twilight, fog, or nighttime environments. The two cameras initiate frame exposure through a hardware synchronization trigger mechanism, ensuring that the timestamp deviation between each frame of visible light and thermal infrared image is controlled within 10 milliseconds, thus forming a strictly time-aligned dual-modal image pair.
[0028] The visible light image sequence and the thermal infrared image sequence are spatiotemporally synchronized and calibrated to generate a dual-modal image pair with aligned timestamps and consistent spatial coordinates. The spatiotemporal synchronization calibration process first relies on a hardware trigger signal to synchronously start the frame exposure timing of the visible light camera and the thermal infrared camera, ensuring that the timestamp deviation between the two images is controlled within 10 milliseconds. Then, based on a pre-calibrated binocular extrinsic matrix, an affine transformation is performed on the thermal infrared image to unify its pixel coordinate system with that of the visible light image. On this basis, an image registration algorithm based on maximizing mutual information is used to perform sub-pixel-level spatial alignment on each frame of the dual-modal image pair, ensuring that the center position offset of the same livestock target in the two images does not exceed one pixel. The mutual information maximization registration algorithm iteratively optimizes the spatial transformation parameters of the thermal infrared image to achieve maximum consistency with the visible light image in terms of grayscale distribution statistical characteristics, thereby compensating for geometric deviations caused by lens distortion, installation errors, and atmospheric refraction.
[0029] The visible light image and thermal infrared image in the dual-modal image pair are independently feature-encoded to obtain visible light feature maps and thermal infrared feature maps. The feature encoding process employs a dual-branch convolutional neural network structure. The visible light branch consists of an improved ResNet-50 backbone network, with its first convolutional layer replaced by a learnable multi-scale receptive field convolutional module to enhance the response capability to livestock targets of different sizes. The thermal infrared branch consists of a lightweight MobileNetV3 network, with the dilation rate of its depth-separable convolutional layers adjusted according to the low-texture characteristics of thermal imaging, set to 2. Both branches output feature maps in the fourth stage, with a spatial resolution of 1 / 16 of the input image and channel dimensions of 2048 and 512, respectively. The multi-scale receptive field convolutional module consists of three parallel dilated convolutional paths with dilation rates set to 1, 2, and 3, respectively. The outputs of each path are concatenated and then compressed using one-dimensional convolution to finally generate an initial feature representation with multi-scale context awareness capabilities. In the thermal infrared branch, depthwise separable convolution with a dilation rate of 2 effectively expands the receptive field, compensating for the lack of high-frequency texture details in thermal imaging, while maintaining computational efficiency.
[0030] The visible light feature map and the thermal infrared feature map are input into the cross-modal feature interaction module, where channel attention weighting and spatial alignment operations are performed to generate a joint feature map that fuses complementary information from both modalities. The cross-modal feature interaction module includes a channel attention fusion unit and a spatial alignment refinement unit. The channel attention fusion unit first performs global average pooling on both the visible light and thermal infrared feature maps to generate their respective channel description vectors. These two description vectors are then concatenated and input into a two-layer fully connected network, outputting the weight coefficients of each channel, which are applied to the original feature maps to achieve inter-modal importance weighting. The first layer of this two-layer fully connected network has 2560 neurons with a modified linear unit activation function, and the second layer has the same number of neurons as the total number of channels (2560) with a sigmoid function activation function. The output value serves as the normalized weight for each channel. The spatial alignment refinement unit uses a deformable convolutional structure, referencing the weighted visible light feature map, to guide the thermal infrared feature map through local geometric deformation compensation, eliminating edge misalignment caused by differences in the physical properties of dual-spectrum imaging. The offset of deformable convolution is predicted by a small convolutional subnetwork that receives the stitched bimodal features as input and outputs a two-dimensional offset vector for each sampling point, thereby achieving pixel-level spatial adaptive alignment.
[0031] Based on the joint feature map, the bounding boxes of all livestock individuals in the image are located using an object detection network. Duplicate and falsely detected targets are eliminated based on the spatial distribution density and morphological constraints of the bounding boxes. The object detection network is an improved MMYOLO object detection framework, whose neck network introduces a bidirectional feature pyramid structure, fusing multi-scale joint feature maps from the cross-modal interaction module. Head detection predicts target center point offset, bounding box size, and confidence score. During training, a weighted combination of the focus loss function and the generalized intersection-union (OCU) loss function is used, with a weight ratio of 1:1.5, to alleviate the positive-negative sample imbalance problem caused by dense livestock arrangement in plateau scenes. (Focus loss function...) The definition is as follows:
[0032] ;
[0033] in, This represents the model's predicted probability for the true class. To balance the weighting of positive and negative samples, a value of 0.75 is used. The focusing parameter is set to 2. Generalized Cross-Union Loss Function The definition is as follows:
[0034] ;
[0035] For the generalized intersection-over-union (IoU), its calculation considers the minimum closure region between the predicted and ground truth boxes, enhancing its regression ability for non-overlapping boxes. The total loss function is... .
[0036] The specific operations for removing duplicate and false positives include: performing non-maximum suppression on all detected bounding boxes, with an intersection-union (IU) threshold set to 0.4; calculating the aspect ratio and area of each retained bounding box; if the aspect ratio is less than 0.3 or greater than 3.0, or the area is less than a preset minimum threshold (corresponding to 30% of the projected area of an adult yak), it is considered a false positive and removed; for cases where the center distance between adjacent bounding boxes is less than a preset threshold (corresponding to 1.2 times the shoulder width of an adult yak), the temperature continuity in the thermal infrared image is used to determine whether they belong to the same individual animal; if the temperature gradient change is gradual, they are merged into a single individual. Temperature continuity is determined by extracting the temperature distribution curves of the areas covered by two bounding boxes from the thermal infrared image and calculating their Pearson correlation coefficient; if the correlation coefficient is greater than 0.85, they are considered to be the same heat source, and the bounding box merging operation is performed.
[0037] The system counts the number of valid bounding boxes after post-processing and outputs the final livestock count result. This statistical process is performed in real time after each frame of image processing is completed, and the results are uploaded to the ranch management platform via a wireless communication module for remote monitoring and decision-making.
[0038] The method also includes an online adaptive update mechanism for the model: during continuous monitoring, manually verified samples of livestock numbers are periodically collected to construct an incremental training set; when the number of new samples accumulates to a preset batch size, the backbone network parameters are frozen, and only the weights of the detection head and cross-modal interaction module are fine-tuned. Momentum gradient descent is used for local optimization over no more than 10 iterations to adapt to seasonal changes in livestock size, coat color, and background vegetation. The incremental training set is constructed by having pasture inspectors randomly select five monitoring periods daily to manually verify the system output, marking incorrect samples and recording the correct number, forming labeled fine-tuned data. During fine-tuning, the learning rate is set to 1 / 10 of the initial training phase, the momentum coefficient is 0.9, and the weight decay coefficient is 0.0005, ensuring that new environmental features are absorbed without destroying existing knowledge.
[0039] In real-world testing at typical high-altitude pastures above 3000 meters, with an average annual sunshine duration exceeding 3000 hours and a diurnal temperature range of 30 degrees Celsius, the livestock number identification accuracy of this method reached 96.7%, an improvement of 28 percentage points compared to the single visible light method. Furthermore, the identification success rate exceeded 90% during foggy weather and twilight. This performance improvement is primarily attributed to the stable thermal radiation characteristics of the thermal infrared mode in low-light conditions, and the refined fusion of information from the two modes through a cross-modal interaction mechanism. The channel attention mechanism effectively suppresses interference from overexposed areas of visible light under strong glare, while the spatial alignment refinement unit corrects the contour blurring caused by atmospheric attenuation in thermal imaging, enabling the joint feature map to possess both high semantic discriminative power and precise localization capability.
[0040] The entire methodology begins with dual-spectral data acquisition, proceeds through spatiotemporal calibration, bi-branch feature encoding, cross-modal interaction, target detection, and post-processing filtering, ultimately outputting counting results to form a closed-loop intelligent recognition chain. Seamless integration between each step is achieved through strictly defined data interfaces, ensuring processing latency is controlled within 200 milliseconds per frame to meet real-time monitoring requirements. During system operation, all intermediate feature maps and detection results are stored on local solid-state storage, supporting post-event backtracking analysis and model diagnostics.
[0041] In extreme weather conditions, such as sandstorms or dense fog, visible light imagery may become completely ineffective, making thermal infrared modality the only reliable information source. This method addresses this by implementing a modal confidence assessment mechanism within the cross-modal interaction module. When the average gradient magnitude of the visible light feature map falls below a preset threshold, its channel weights are automatically reduced, or even set to zero, relying entirely on thermal infrared features for detection, thus ensuring system availability under extreme conditions. This confidence assessment calculates the gradient energy of the visible light imagery using the Sobel operator; if the energy value is less than 10% of that in a normal daytime scene, a single-modal operation mode is triggered.
[0042] In addition, to cope with frequent power fluctuations and communication interruptions in high-altitude areas, the system has built-in breakpoint resume and local caching mechanisms. When the network connection is interrupted, the identification results are temporarily stored locally and uploaded in batches after the connection is restored; when the power supply voltage is lower than the safe threshold, the system automatically enters a low-power standby state, only maintaining the periodic wake-up of the core sensors to ensure that critical data is not lost.
[0043] In summary, this embodiment constructs a set of intelligent livestock number identification methods suitable for extreme plateau environments by using four core technologies: multimodal perception, refined feature interaction, robust post-processing, and online adaptation. It solves the fundamental problem of the failure of traditional single visible light solutions under complex lighting conditions and achieves high-precision, all-weather, and adaptive intelligent monitoring capabilities.
[0044] Example 2: This example also provides another intelligent identification method for livestock numbers on plateaus. This method uses a drone equipped with multimodal sensing equipment, integrates an optimized target detection algorithm, and constructs a complete chain of "data acquisition - intelligent analysis - performance verification" to achieve high-precision and high-efficiency intelligent counting of Tibetan cattle and sheep in the complex environment of the plateau. The following is a detailed description of the specific implementation method.
[0045] The aerial work platform utilizes a multi-rotor UAV with excellent flight stability, boasting a maximum endurance of no less than 40 minutes and wind resistance up to level 6. It can stably hover or cruise in high-altitude pastures above 3000 meters. The UAV is equipped with a coaxially mounted dual-spectrum imaging device, comprising a visible light camera and a thermal infrared camera, both sharing the same optical center. The lens focal lengths are calibrated and matched to ensure high overlap of the imaging field of view within a preset monitoring distance range of 10-50 meters, adapting to the angle and range requirements of aerial UAV photography.
[0046] The visible light camera employs a global shutter sensor, possessing a wide dynamic range imaging capability of no less than 120dB. This allows it to clearly preserve details such as the coat color and outline of Tibetan cattle and sheep under conditions of strong glare at high altitudes or low illumination at dawn and dusk. The thermal infrared camera uses an uncooled vanadium oxide microbolometer focal plane array, with a thermal sensitivity of no more than 50 milliklvin and a spatial resolution of 640×480 pixels. This enables it to accurately capture the thermal radiation signals from the surface of Tibetan cattle and sheep in low-light environments such as fog and nighttime. The UAV autonomously flies along a preset route, simultaneously triggering the dual-spectrum imaging device to acquire visible light and thermal infrared image sequences within the same field of view. Hardware trigger signals synchronously activate the frame exposure sequence of the two cameras, ensuring that the timestamp deviation between the two images is controlled within 10 milliseconds, forming a strictly time-aligned dual-modal image pair, providing a high-quality data foundation for subsequent analysis.
[0047] Spatiotemporal synchronization calibration is performed on visible light image sequences and thermal infrared image sequences acquired by UAVs to generate dual-modal image pairs with aligned timestamps and consistent spatial coordinates. The specific process includes: first, relying on the hardware triggering mechanism onboard the UAV, ensuring that the timestamp deviation between the two images is controlled within 10 milliseconds; then, based on a pre-calibrated binocular extrinsic matrix, an affine transformation is performed on the thermal infrared image to align its pixel coordinate system with that of the visible light image; furthermore, an image registration algorithm based on maximizing mutual information is used to perform sub-pixel-level spatial alignment on each frame of the dual-modal image pair, ensuring that the center position offset of the same Tibetan cattle / sheep target in the two images does not exceed one pixel, compensating for geometric deviations caused by slight fluctuations in UAV flight attitude or atmospheric refraction.
[0048] Independent feature encoding is performed on the visible light image and the thermal infrared image in the dual-modal image pair to obtain visible light feature maps and thermal infrared feature maps. The feature encoding process adopts a dual-branch convolutional neural network structure, wherein:
[0049] The visible light branch is composed of an improved ResNet-50 backbone network, whose first convolutional layer is replaced by a learnable multi-scale receptive field convolutional module. This module consists of three parallel dilated convolutional paths (dilation rates of 1, 2, and 3, respectively). The outputs of each path are concatenated and then compressed by one-dimensional convolution, which can effectively capture the feature differences of Tibetan cattle and sheep from juvenile to adult.
[0050] The thermal infrared branch is composed of a lightweight MobileNetV3 network with a depth separable convolutional layer expansion rate of 2, which effectively expands the receptive field, compensates for the lack of high-frequency texture details in thermal imaging, and maintains computational efficiency, adapting to the needs of UAV edge computing or ground-based rapid processing.
[0051] Both branches output feature maps in the fourth stage, with a spatial resolution of 1 / 16 of the input image and channel dimensions of 2048 and 512 respectively, providing structured feature support for cross-modal fusion.
[0052] Visible light and thermal infrared feature maps are input into the cross-modal feature interaction module, where channel attention weighting and spatial alignment operations are performed to generate a joint feature map that fuses complementary information from both modes. This module includes a channel attention fusion unit and a spatial alignment refinement unit.
[0053] The channel attention fusion unit first performs global average pooling on the two feature maps to generate their respective channel description vectors. After concatenation, the vectors are input into a two-layer fully connected network (the first layer has 2560 neurons and the activation function is a modified linear unit; the second layer has 2560 neurons and the activation function is a sigmoid function). The output is the weight coefficient of each channel, which is applied to the original feature map to achieve dynamic importance weighting between modes. For example, in strong glare scenes, the weight of the thermal infrared mode is automatically increased, and in sunny weather, the texture feature contribution of the visible light mode is enhanced.
[0054] The spatial alignment refinement unit adopts a deformable convolution structure. Using the weighted visible light feature map as a reference, it guides the thermal infrared feature map to perform local geometric deformation compensation, eliminates edge misalignment caused by the difference in physical properties of dual-spectrum imaging, and ensures that the outline features of Tibetan cattle and sheep are accurately aligned in the joint feature map.
[0055] A target detection network based on the YOLOv8 model was used for in-depth optimization and customized training, taking into account the morphological characteristics of Tibetan cattle and sheep (such as robust body shape, thick hair, and dense herd distribution) and the complex environment of the plateau (strong glare, low illumination, and monotonous background vegetation).
[0056] A bidirectional feature pyramid structure is introduced into the network neck to enhance the feature fusion ability of Tibetan cattle and sheep at different scales (young sheep and adult yaks have significant differences in body size);
[0057] The detection head optimizes the prediction logic of target center point offset, bounding box size and confidence score to adapt to the outline ratio of Tibetan cattle and sheep (the aspect ratio is mostly between 0.6 and 1.5).
[0058] During the training phase, a weighted combination of the focus loss function and the generalized intersection-union loss function (weight ratio 1:1.5) is used to alleviate the problem of positive and negative sample imbalance caused by the dense arrangement of Tibetan cattle and sheep in the plateau scene. The training dataset contains more than 50,000 bimodal images of Tibetan cattle and sheep under different seasons and lighting conditions on the plateau to ensure the generalization ability of the model.
[0059] Based on the joint feature map, the bounding boxes of all Tibetan cattle and sheep individuals in the image are located using the optimized YOLOv8 model. Duplicate and false detections are then eliminated based on the spatial distribution density, morphological constraints, and thermal radiation continuity criteria of the bounding boxes. Specific operations include:
[0060] Non-maximum suppression is performed on all detected bounding boxes, with the intersection-union ratio threshold set to 0.4, and overlapping redundant boxes are removed.
[0061] Calculate the aspect ratio and area of each retained bounding box. If the aspect ratio is less than 0.3 or greater than 3.0, or the area is less than the preset minimum threshold, it is judged as a false detection and removed.
[0062] For cases where the center distance between adjacent bounding boxes is less than a preset threshold (corresponding to 1.2 times the shoulder width of an adult Tibetan cattle), the temperature continuity in the thermal infrared image is used to determine whether they belong to the same individual animal: the temperature distribution curves of the areas covered by the two bounding boxes are extracted, and their Pearson correlation coefficient is calculated. If the correlation coefficient is greater than 0.85, they are considered to be the same heat source (to avoid duplicate counting due to dense populations), and the bounding box merging operation is performed.
[0063] The system counts the number of valid bounding boxes after post-processing and outputs the final identification result of the Tibetan cattle and sheep population. This statistical process is performed in real time after each frame of image processing is completed. The image data collected by the UAV can be transmitted back to the ground station in real time via the wireless communication module, and the identification results are simultaneously uploaded to the ranch management platform for remote monitoring and decision-making by managers. In the event of a communication interruption, the system uses a breakpoint resume and local caching mechanism to temporarily store the identification results in the UAV's local storage module and upload them in batches after the network is restored, ensuring that no data is lost.
[0064] During continuous monitoring, samples of the true number of Tibetan cattle and sheep were collected periodically for manual verification to construct an incremental training set. Random pasture inspectors selected five monitoring periods daily to manually verify the system output, marking incorrect samples and recording the correct numbers, thus creating labeled, fine-tuned data. When the number of new samples accumulated to a preset batch size (e.g., 1000 images), the backbone network parameters were frozen, and only the weights of the YOLOv8 detector head and the cross-modal interaction module were fine-tuned. Momentum gradient descent was used for local optimization over no more than 10 iterations, with the learning rate set to 1 / 10 of the initial training phase, the momentum coefficient to 0.9, and the weight decay coefficient to 0.0005. This was to adapt to seasonal changes in the body size (e.g., thick winter coat), coat color, and background vegetation (e.g., withered meadows) of Tibetan cattle and sheep, ensuring the model's long-term robustness.
[0065] In addition, to cope with the frequent power fluctuations in high-altitude areas, the power supply system on the UAV is equipped with a voltage stabilization module. When the power supply voltage is lower than the safety threshold, the system automatically enters a low-power standby state, only maintaining the periodic wake-up of the core sensors to ensure that critical data is not lost. All intermediate feature maps and detection results of the system are stored in local solid-state memory, supporting post-event backtracking analysis and model diagnosis, further improving the integrity of the operation closed loop.
[0066] In summary, this embodiment, through a complete operational loop of "precise data acquisition layer - intelligent analysis core layer - application performance verification," deeply integrates the advanced capabilities of high-performance UAV hardware and optimized YOLOv8 algorithm to construct an intelligent counting method suitable for the extreme environment of the plateau and specifically adapted for the identification of Tibetan cattle and sheep. It solves the problems of the failure of traditional single visible light solutions under complex lighting conditions and the low efficiency of manual counting, and achieves high-precision, all-weather, adaptive, and high-efficiency intelligent monitoring capabilities, providing strong technical support for the development of smart animal husbandry on the plateau.
Claims
1. A method for intelligent identification of livestock numbers on plateaus, characterized in that, include: A dual-spectral imaging device deployed in the monitoring area of the plateau pasture simultaneously acquires visible light image sequences and thermal infrared image sequences within the same field of view; The visible light image sequence and the thermal infrared image sequence are spatiotemporally synchronized and calibrated to generate a dual-modal image pair with timestamp alignment and consistent spatial coordinates. The visible light image and the thermal infrared image in the dual-modal image pair are independently feature-coded to obtain visible light feature maps and thermal infrared feature maps; The visible light feature map and the thermal infrared feature map are input into the cross-modal feature interaction module, and channel attention weighting and spatial position alignment operations are performed to generate a joint feature map that fuses dual-modal complementary information. Based on the joint feature map, the bounding boxes of all livestock individuals in the image are located by the target detection network, and duplicate detections and false detections are eliminated according to the spatial distribution density and morphological constraints of the bounding boxes. The number of valid bounding boxes after post-processing is counted, and the final livestock number recognition result is output.
2. The intelligent identification method for the number of livestock on the plateau according to claim 1, characterized in that, The dual-spectrum imaging device includes a visible light camera and a thermal infrared camera mounted coaxially, both sharing the same optical center. Their lens focal lengths are calibrated and matched to ensure that the overlap of the imaging field of view is not less than 95% within a preset monitoring distance. The visible light camera uses a global shutter photosensitive element; The thermal infrared camera uses an uncooled vanadium oxide microbolometer focal plane array.
3. The intelligent identification method for the number of livestock on the plateau according to claim 1, characterized in that, The visible light image sequence and the thermal infrared image sequence are spatiotemporally synchronized and calibrated to generate a dual-modal image pair with aligned timestamps and consistent spatial coordinates, including: The frame exposure sequence of the visible light camera and the thermal infrared camera is started synchronously using hardware trigger signals, so that the timestamp deviation between the two images is controlled within 10 milliseconds. Based on the pre-calibrated binocular extrinsic matrix, an affine transformation is performed on the thermal infrared image to make its pixel coordinate system the same as that of the visible light image. An image registration algorithm based on maximizing mutual information is adopted to perform sub-pixel-level spatial alignment on each frame of dual-modal image, ensuring that the center position offset of the same livestock target in the two images does not exceed one pixel.
4. The intelligent identification method for the number of livestock on the plateau according to claim 1, characterized in that, Independent feature encoding is performed on the visible light image and the thermal infrared image in the dual-modal image pair to obtain visible light feature maps and thermal infrared feature maps, including: A dual-branch convolutional neural network structure is used to extract features from visible light images and thermal infrared images respectively; The visible light branch consists of an improved ResNet-50 backbone network, whose first convolutional layer has been replaced with a learnable multi-scale receptive field convolutional module. The thermal infrared branch is composed of a lightweight MobileNetV3 network, and the dilation rate of its depth-separable convolutional layers is set to 2. Both branches output feature maps in the fourth stage, with a spatial resolution of 1 / 16 of the input image and channel dimensions of 2048 and 512, respectively.
5. The intelligent identification method for the number of livestock on the plateau according to claim 4, characterized in that, The multi-scale receptive field convolution module consists of three parallel dilated convolution paths with dilation rates of 1, 2, and 3, respectively. The outputs of each path are concatenated and then compressed through one-dimensional convolution.
6. The intelligent identification method for the number of livestock on the plateau according to claim 1, characterized in that, The visible light feature map and the thermal infrared feature map are input into the cross-modal feature interaction module, where channel attention weighting and spatial alignment operations are performed to generate a joint feature map that fuses complementary information from both modes, including: Global average pooling is performed on the visible light feature map and the thermal infrared feature map respectively to generate their respective channel description vectors; The two channel description vectors are concatenated and input into a two-layer fully connected network. The weight coefficients of each channel are output and applied to the original feature map to achieve intermodal importance weighting. Using the weighted visible light feature map as a reference, a deformable convolutional structure is used to guide the thermal infrared feature map to perform local geometric deformation compensation, thereby eliminating edge misalignment caused by the difference in physical properties of dual-spectral imaging.
7. The intelligent identification method for the number of livestock on the plateau according to claim 6, characterized in that, The first layer of the two-layer fully connected network has 2560 neurons and uses the modified linear unit (MRU) activation function. The second layer has 2560 neurons, which is equal to the total number of channels, and uses the sigmoid function activation function.
8. The intelligent identification method for the number of livestock on the plateau according to claim 1, characterized in that, Based on the joint feature map, the bounding boxes of all livestock individuals in the image are located using a target detection network. Then, based on the spatial distribution density and morphological constraints of the bounding boxes, duplicate and false detections are eliminated, including: An improved MMYOLO object detection framework is adopted as the object detection network, and its neck network introduces a bidirectional feature pyramid structure to fuse multi-scale joint feature maps; The detection head predicts the target center point offset, bounding box size, and confidence score respectively; During the training phase, a weighted combination of the focus loss function and the generalized intersection-union loss function is used, with a weight ratio of 1:1.
5.
9. The intelligent identification method for the number of livestock on the plateau according to claim 8, characterized in that, The specific steps for eliminating duplicate and false positive targets include: Non-maximum suppression is performed on all detected bounding boxes, with the intersection-union ratio threshold set to 0.4; Calculate the aspect ratio and area of each retained bounding box. If the aspect ratio is less than 0.3 or greater than 3.0, or the area is less than the preset minimum threshold, it is judged as a false detection and removed. For cases where the center distance between adjacent bounding boxes is less than a preset threshold, the temperature continuity in the thermal infrared image is used to determine whether they belong to the same individual animal. If the temperature gradient changes gradually, they are merged into a single individual.
10. The intelligent identification method for the number of livestock on the plateau according to claim 9, characterized in that, The preset minimum threshold corresponds to 30% of the projected area of an adult yak, and the preset threshold corresponds to 1.2 times the shoulder width of an adult yak; the temperature continuity judgment is made by calculating the Pearson correlation coefficient of the temperature distribution curves of the two bounding box covered areas. If the correlation coefficient is greater than 0.85, the bounding box merging operation is performed.