Animal behavior recognition method and system based on multi-scale semantic perception time sequence network

By using a multi-scale semantic-aware temporal network (MSTANet) to collaboratively model long-term stable behaviors and short-term transient actions at the edge, the real-time and accuracy problems of animal behavior recognition in traditional methods are solved, achieving efficient and low-latency animal behavior monitoring, which is suitable for large-scale breeding scenarios.

CN122196652APending Publication Date: 2026-06-12ZHONGKE JIEYUN (BEIJING) INFORMATION TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
ZHONGKE JIEYUN (BEIJING) INFORMATION TECH CO LTD
Filing Date
2026-01-29
Publication Date
2026-06-12

Smart Images

  • Figure CN122196652A_ABST
    Figure CN122196652A_ABST
Patent Text Reader

Abstract

The application discloses an animal behavior recognition method and system based on a multi-scale semantic perception time sequence network. The method comprises the following steps: (1) collecting original motion data of an animal through an inertial sensor, and pre-processing the original motion data to obtain cleaned inertial time sequence signals; (2) segmenting the inertial time sequence signals by using a fixed-length sliding time window to construct model input samples; and (3) inputting the model input samples into a pre-trained multi-scale semantic perception time sequence network MSTANet model, and outputting an animal behavior recognition result by the MSTANet model. The MSTANet model comprises: ① a large-scale time sequence modeling module LTM; ② a small-scale time sequence refinement module STR; and ③ a semantic perception supervision module SAS.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of intelligent livestock farming technology, specifically an animal behavior recognition method and system based on a multi-scale semantic perception temporal network. Background Technology

[0002] Continuous monitoring of animal behavior is a key data collection pathway for achieving smart livestock farming. Behavioral patterns are important indicators of animal health and welfare levels, and can reflect farming efficiency. In modern large-scale farms, timely identification of animals exhibiting abnormal or high-risk behaviors helps enable early intervention and scientific decision-making. However, traditional monitoring methods relying on manual inspections are inefficient and difficult to implement continuously in large-scale farming environments, prompting researchers to explore automated animal behavior identification methods.

[0003] Vision-based behavior recognition methods have been extensively studied in animal behavior analysis, but their application in actual farming environments is often limited by factors such as occlusion, changes in lighting, and camera position. In contrast, wearable or implantable inertial sensors offer advantages such as small size, low power consumption, and low environmental dependence, enabling continuous monitoring of individual animals. With the widespread availability of low-cost accelerometers and gyroscopes, inertial sensor-based animal behavior recognition is gradually becoming a research hotspot in the field of precision farming.

[0004] Accelerometers, due to their strong ability to capture linear motion, have been widely validated as an effective tool for animal activity recognition. However, they are less effective at detecting rotational motion, often leading to classification errors. Gyroscopes, used to measure angular velocity, can directly acquire information about the rotational dynamics of an object. Therefore, employing an inertial measurement unit that combines accelerometers and gyroscopes can leverage the sensory advantages of both sensors to achieve more comprehensive and accurate recognition of various animal behaviors.

[0005] Machine learning, with its powerful data processing and analysis capabilities, has been widely applied to animal behavior classification based on inertial sensor data. Classic machine learning algorithms such as Random Forest (RF), Support Vector Machine (SVM), K-Nearest Neighbors, and Logistic Regression (LR) have been successfully applied to behavior monitoring tasks in various livestock such as cattle, sheep, and pigs. However, traditional machine learning requires complex and time-consuming manual feature extraction and heavily relies on expert domain knowledge, which leads to challenges in feature engineering.

[0006] Deep learning algorithms have demonstrated significant advantages in animal behavior classification due to their ability to bypass complex feature engineering. Research on deep learning processing based on inertial sensing devices has also yielded promising results. Current research on animal activity recognition using deep networks is largely based on image data. In contrast, research on animal behavior recognition using deep networks to process data from inertial sensing devices is relatively limited, and related work is still focused on improving recognition accuracy, with insufficient attention paid to the model's real-time inference capabilities and system adaptability for real-world farm applications.

[0007] Furthermore, deploying deep behavior recognition models to inertial sensing devices or edge nodes still faces significant challenges. On the one hand, terminal devices, such as electronic ear tags, are limited by hardware resources, with very limited computing power, battery capacity, and storage space. On the other hand, in large-scale aquaculture scenarios, continuously transmitting sensor data to the cloud or central server for processing will significantly increase communication energy consumption and system load. In contrast, performing behavior recognition at the edge and only uploading compact behavior results is an effective way to achieve energy-efficient, low-latency monitoring.

[0008] More importantly, animal behavior exhibits significant inherent heterogeneity. From a temporal perspective, behaviors such as walking and head shaking naturally possess short-term, transient characteristics, while behaviors like eating and lying down demonstrate long-term, stable, overall movements. From a feature representation perspective, the inertial feature distributions of similar semantic behaviors often show high similarity. Under fixed-window segmentation, existing temporal modeling methods based on pure CNNs, RNNs, or Transformers struggle to simultaneously achieve both sensitive capture of short-term key actions and robust modeling of long-term stable behaviors. Furthermore, the single supervision target further exacerbates the problem of inter-class semantic confusion.

[0009] In summary, continuous and accurate automatic identification of animal behavior is crucial for assessing health status, welfare levels, and achieving precision feeding management in large-scale farms. Traditional manual inspection methods are inefficient and cannot meet the real-time monitoring needs of large-scale farming. Computer vision-based methods are easily affected by factors such as changes in lighting and occlusion within the pigsty, limiting their applicability. In recent years, wearable inertial sensors have provided an effective solution for individual-level behavior monitoring. However, animal behavior exhibits significant heterogeneity over time: behaviors such as "walking" and "head shaking" have short-term, transient characteristics, while behaviors such as "eating" and "lying down" are characterized by longer durations and stable overall movements. Furthermore, semantically similar behaviors (such as "eating" and "biting," which belong to the category of "mouth and nose movements") have highly similar inertial feature distributions, making misjudgment highly likely. Existing methods based on machine learning or single deep learning models (such as CNN and RNN) struggle to simultaneously model long-term stable behaviors and sensitively capture short-term key actions within a fixed time window, and they tend to ignore the hierarchical semantic structure of behaviors, resulting in limited recognition accuracy. Meanwhile, existing research focuses on improving the recognition accuracy of cloud processing, while paying insufficient attention to the real-time inference capabilities of models on resource-constrained edge devices and the feasibility of system deployment.

[0010] Therefore, in the context of smart agriculture, accurate identification of animal behavior is crucial for animal health monitoring and precision farming management. Traditional manual patrols struggle to meet the precise, real-time monitoring needs of large-scale farming, and vision- and acoustic-based detection methods have limited applicability in complex farming environments. In contrast, wearable or implantable inertial sensors can achieve individual-level behavioral perception without relying on environmental conditions, providing an effective technological alternative. However, animal behavior exhibits significant heterogeneity in terms of time scale and movement patterns, posing challenges to accurate temporal modeling and edge deployment with limited computing resources. There is an urgent need for an animal behavior recognition solution capable of collaboratively modeling multi-scale temporal features, incorporating semantic prior knowledge, and suitable for low-power edge deployment. Summary of the Invention

[0011] The present invention aims to overcome the shortcomings of the prior art and provide an animal behavior recognition method and system based on a multi-scale semantic perception temporal network.

[0012] To achieve the above objectives, the technical solution provided by this invention is as follows: The animal behavior recognition method based on multi-scale semantic-aware temporal networks includes the following steps: (1) Collect raw motion data through an inertial sensor and preprocess the raw motion data to obtain a clean inertial time series signal; (2) The inertial time series signal is segmented using a sliding time window of fixed length to construct model input samples; (3) Input the model input samples into the pre-trained multi-scale semantic perception temporal network MSTANet model, and output the animal behavior recognition results by the MSTANet model; the MSTANet model includes: ① large-scale temporal modeling module LTM, which is used to downsample the input sequence and expand the receptive field to capture long-term stable behavioral features; ② small-scale temporal refinement module STR, which is used to adaptively reweight the features output by the LTM module in the time dimension to enhance the capture of short-term transient action features; ③ semantic perception supervision module SAS, which is used to apply regularization constraints to the shared feature representation of the model using the hierarchical semantic structure of animal behavior during the model training stage.

[0013] Preferably, the preprocessing in step (1) involves performing median filtering and Butterworth low-pass filtering sequentially on the triaxial acceleration or triaxial acceleration and triaxial angular velocity signals to remove noise, and performing high-pass filtering on the acceleration signal to separate the gravity component, thereby obtaining the body coordinate system acceleration signal that reflects the dynamic motion of the body.

[0014] Preferably, the length of the fixed-length sliding time window in step (2) is 20-30 seconds, and the overlap ratio between adjacent windows is 40-60%.

[0015] Preferably, the large-scale temporal modeling module LTM in step (3) adopts a lightweight convolutional network structure that includes depth-separable one-dimensional convolution and grouped pointwise convolution GLB, and compresses the time dimension through temporal downsampling with two-level strides.

[0016] More preferably, the number of groups G in the grouped pointwise convolution GLB is 1-8.

[0017] Preferably, the small-scale temporal refinement module STR in step (3) adopts a multi-head self-attention mechanism, and its calculation process is expressed as follows: , Where Q, K, and V are the query, key, and value matrices obtained by linear mapping of the input features, respectively. This is the scaling factor.

[0018] Preferably, in step (3), the semantic awareness supervision module SAS constructs a fine-grained main classification head and at least one auxiliary classification head corresponding to a coarse-grained semantic level during the training phase; the input features of the auxiliary classification head are calculated through a progressive gradient backfeeding mechanism: , in, haux For shared feature representation, α is a fixed refeedback coefficient that gradually increases from 0 to 1 as the training progresses. stopgrad(·) This is an operator that blocks gradient propagation.

[0019] Preferably, the MSTANet model employs a weighted multi-task cross-entropy loss: , in, To achieve the final fine-grained classification loss, class weighting is introduced to alleviate the class imbalance problem; Indicates the first K Cross-entropy loss for auxiliary semantic tasks, λ k These are the corresponding weighting coefficients.

[0020] This invention also provides an animal behavior detection system based on the above method. The animal behavior detection system includes: ① a wearable or implantable intelligent sensing terminal containing inertial sensors, preferably a wearable intelligent sensing ear tag, for collecting inertial sensor data from animals; ② an edge computing gateway, deployed at the animal breeding facility, for receiving data uploaded by the intelligent sensing terminal, executing the above method, and outputting structured behavior recognition results; or, deployed on a management platform, for executing the method described in any one of claims 1-8 on the data collected and transmitted to the management platform, and outputting structured behavior recognition results; ③ a cloud management platform, for receiving and storing the behavior recognition results uploaded by the edge computing gateway, and performing statistical analysis, visualization, and historical data backtracking.

[0021] The present invention also provides a computer-readable storage medium having a computer program stored thereon, the computer program executing to implement the above-described method.

[0022] Preferably, the inertial sensor described in this invention is a three-axis inertial sensor or a six-axis inertial sensor, and more preferably a six-axis inertial sensor.

[0023] The present invention will be further described below: This invention proposes a multi-scale semantic-aware temporal behavior recognition network (MSTANet) based on inertial sensors. This network achieves collaborative modeling of long-term stable activities and short-term sudden actions through a large-scale temporal modeling (LTM) module and a small-scale temporal refinement (STR) module. Simultaneously, a semantic-aware supervision mechanism (SAS) is introduced during the training phase to alleviate confusion between semantically similar behaviors. Using pigs as the research object for animal behavior recognition, the proposed method was validated in a real-world, large-scale pig farm environment. Results show that the method achieves a 93.69% accuracy rate in recognizing seven types of daily pig behaviors, with a single-sample inference latency of only 1.34 ms at the edge. This scheme maintains recognition accuracy while having low resource consumption, providing an effective and scalable technical solution for the construction of continuous animal behavior monitoring and edge intelligent farming systems based on wearable or implantable inertial sensors.

[0024] The method described in this invention starts from the deployment requirements of real-world aquaculture scenarios and constructs a unified temporal modeling paradigm that simultaneously addresses differences in behavioral time scales and semantic structural constraints.

[0025] This invention proposes a lightweight multi-scale temporal modeling network, MSTANet, for edge deployment. It can collaboratively model long-term stable behaviors and short-term transient actions within a fixed time window, achieving end-to-end animal behavior recognition based on inertial sensor data. A semantic-aware supervision strategy is designed, implicitly regularizing shared temporal features using a behavioral-level semantic structure, significantly improving the stability of discriminating semantically similar behaviors without introducing additional inference costs. A complete edge-cloud collaborative animal behavior monitoring system is constructed and deployed and tested in a real-world large-scale pig farm environment. Experimental results show that the proposed method can operate stably under conditions of limited edge computing resources and achieves a good balance between recognition accuracy and inference latency.

[0026] Compared with the prior art, the beneficial effects of the present invention are as follows: 1. High accuracy and strong discriminative power: Through the collaboration of LTM and STR modules, the problem of unified modeling of long and short time behaviors under a fixed time window is effectively solved; through the SAS mechanism, the confusion between semantically similar behaviors is significantly reduced, and the overall recognition accuracy can reach more than 93.69%.

[0027] 2. Lightweight and low latency: The model structure is designed for edge computing, with only about 0.15M parameters and a size of 0.56MB. The single-sample inference latency on the edge gateway is as low as 1.34ms, which meets the real-time requirements.

[0028] 3. Low communication bandwidth requirements: Adopting an edge-cloud collaborative architecture, behavior recognition is completed on the gateway side, and only a very small amount of recognition results are uploaded. Compared with uploading all the original data, the amount of communication data is reduced by about 99.6%, which greatly alleviates network pressure.

[0029] 4. High practicality: This invention has been deployed and verified in a real animal farm environment. The system has good robustness and provides reliable technical support for large-scale intelligent farming.

[0030] In summary, the method described in this invention, while ensuring the accuracy of behavior recognition, keeps model complexity and inference latency within an acceptable range at the edge, making the collaborative processing flow of data acquisition, edge inference, and cloud analysis practically deployable and better suited to the real-world needs of large-scale farms for long-term, low-power, and low-maintenance operation. By completing data acquisition at the data acquisition terminal side (such as ear tags) and performing inference at the edge gateway side, only uploading structured behavior results and necessary statistical information, this architecture effectively reduces the communication and energy consumption overhead caused by continuous transmission of raw time-series data, and provides direct data support for management decisions such as behavior duration statistics, abnormal pattern recognition, and individual behavior comparison. Furthermore, this lightweight time-series modeling and edge-cloud collaborative paradigm does not rely on specific visual assumptions and has the potential to be transferred to other wearable behavior recognition tasks for livestock, providing a replicable technical path for the construction of multi-livestock intelligent farming systems. Based on the technology of this invention, through animal behavior monitoring and data analysis, it can be further used for dynamic monitoring of animal growth, health status, and nutritional status. Attached Figure Description

[0031] Figure 1 : Flowchart of the method of this invention. Detailed Implementation

[0032] To make the objectives, technical solutions, and beneficial effects of the embodiments of the present invention clearer, further descriptions will be provided below in conjunction with specific implementations of the present invention. These embodiments are for illustrative purposes only and are not intended to limit the scope of protection of the present invention. Other implementation methods obtained by those skilled in the art based on the disclosed embodiments of the present invention without creative effort should all fall within the scope of protection of the present invention. Specific steps in the embodiments are applied to pigs, but this does not indicate that the technical solutions described in the present invention can only be used for pigs; they can also be used for animals including cattle, sheep, and poultry. The original experimental data involved in the embodiments are stored in the independently developed Zhongke Jieyun Intelligent Breeding Platform.

[0033] Example 1 See Figure 1 The animal behavior recognition method based on multi-scale semantic-aware temporal networks includes the following steps: (1) Raw motion data is collected by a wearable six-axis inertial sensor worn on the pig, and the raw motion data is preprocessed to obtain a clean inertial time series signal: This invention constructs a pig behavior recognition dataset based on wearable six-axis inertial sensing data from 10 experimental pigs. Data collection lasted for 5 consecutive days, with effective collection periods from 6:00 to 22:00 each day. To systematically evaluate the model's generalization ability across different individuals and to closely approximate the application requirements of the model for unknown individuals in real-world farming scenarios, the dataset was rigorously divided along the individual dimension: data from 9 pigs were selected as the modeling dataset, divided into training, validation, and test sets at a ratio of 70%, 15%, and 15%, respectively, for model training, hyperparameter optimization, and performance evaluation. The complete behavioral data of the remaining pig was used to construct a separate cross-individual independent test set to evaluate the model's recognition and segmentation performance under conditions of unknown individuals and continuous daily behavior flow, thereby more realistically simulating the actual application scenario in a large-scale farming environment. All pig daily behaviors were manually labeled based on synchronously collected video data.

[0034] By considering the differences in pigs' daily behavior across overall movement patterns, action execution sites, and functional semantics, this invention constructs a three-level hierarchical pig behavior classification system based on fine-grained behavior annotation. This hierarchical definition is used for data annotation and serves as a priori basis for subsequent semantic perception supervision.

[0035] At Level 1 (Global Motion Level), pig behavior is categorized into displacement and non-displacement behaviors based on whether they are accompanied by significant overall displacement, thus characterizing the fundamental differences in macroscopic motion patterns. At Level 2 (Functional Semantic Level), semantic constraints based on the primary site of action are introduced for non-displacement behaviors, further subdividing them into snout movements and posture / head movements. This level aims to mitigate the confusion caused by the high similarity of different fine-grained behaviors in the inertial signal space. At Level 3 (Fine-Grained Behavioral Level), based on the above two-level semantic structures, seven typical daily pig behaviors are defined for model training and performance evaluation. The behavior categories and their specific definitions at each level are shown in Table 1.

[0036] Table 1. Hierarchical Definition of Daily Behaviors in Pigs ; Because the raw readings from accelerometers and gyroscopes contain a significant amount of noise, this invention employs a staged filtering strategy to preprocess the raw inertial signals. First, median filtering is applied to the triaxial acceleration and triaxial angular velocity signals to eliminate pulse noise caused by equipment jitter or communication interference. Then, a third-order Butterworth low-pass filter is used to suppress high-frequency interference components while retaining the main dynamic information related to the pig's movement. For the acceleration signal, a low-order high-pass Butterworth filter is further introduced to separate the gravity component, obtaining a body coordinate system acceleration signal that reflects only the dynamic motion of the body, thereby reducing the impact of attitude changes on subsequent feature modeling. After the above processing, a multi-axis inertial time series is obtained for subsequent feature construction.

[0037] (2) The inertial time series signal is segmented using a sliding time window of fixed length to construct model input samples: After signal cleaning, continuous inertial sequences are trimmed according to manually labeled time intervals, and fixed-length sliding time windows are used to segment the action segments to construct a uniform sequence input format. A 50% overlap ratio is set between adjacent windows to improve sample utilization while maintaining temporal continuity.

[0038] This invention selects 20 to 30 seconds, preferably 20 seconds, as the uniform time window length. To support batch training of deep learning models, all window sequences are adjusted to a fixed length: sequences with insufficient length are padded with zeros, while sequences exceeding the preset length are truncated.

[0039] This invention selects 10-dimensional inertial features as the final input feature set. This feature set consists of two parts: rotational motion features and linear motion features. The rotational motion features include denoised three-axis angular velocity components and their corresponding first and second norms, used to characterize the local rotation patterns and overall rotation intensity of the organism. The linear motion features consist of the three-axis acceleration signal after removing the gravity component and its first and second norms, used to characterize the linear motion patterns and overall motion intensity of the organism.

[0040] (3) Input the model input samples into the pre-trained multi-scale semantic-aware temporal network MSTANet model, and the MSTANet model outputs the pig behavior recognition result; the MSTANet model includes: ① The Large-Scale Temporal Modeling (LTM) module is used to downsample and expand the receptive field of the input sequence to capture long-term stable behavioral features: Let the input sequence be... ,in T The time window length, CThis represents the number of feature channels. First, a depthwise separable one-dimensional convolutional backbone (Stem) is applied to the input sequence, calculated as follows: ,in, This represents a channel-wise one-dimensional convolution, used to extract local temporal dependencies within a single channel. Pointwise convolution is used to achieve cross-channel feature fusion. Considering the differences in physical meaning and dynamic response among different inertial channels, a lightweight GroupLiteBlock (GLB) convolutional network is further introduced to perform cross-channel information interaction within a restricted channel subspace. Its form is as follows: ,in, This represents a channel-wise one-dimensional convolution, which models local time dependencies only within a single inertial channel, and is used to extract fine-grained dynamic features of different behaviors in the time dimension. This represents point convolution operations performed within predefined channel groups, used to achieve constrained cross-channel information fusion within a physically consistent feature subspace. Each convolutional sublayer is followed by batch normalization and nonlinear activation, and random deactivation is applied at the end of the module to enhance generalization ability. Through two levels of temporal downsampling with a stride of 2, the time dimension is compressed to one-quarter of the original length, significantly reducing the computational complexity of subsequent modeling while preserving the overall dynamic contour. ② The small-scale temporal refinement module (STR) is used to adaptively reweight the features output by the LTM module in the time dimension to enhance the capture of short-term transient action features: Let the downsampled temporal features be represented as... The input features are projected into a query, key, and value matrix through a linear mapping: ,in , , This is the projection matrix. Subsequently, scaled dot product attention is used to calculate temporal weights and aggregate features: in The scaling factor is used. Multi-head attention enhances the ability to represent diverse instantaneous action patterns by modeling the temporal dependencies of different subspaces in parallel. Residual connections and layer normalization are introduced internally to stabilize the training process, and channel mixing is performed through a feedforward network (FFN) to improve the non-linear expressiveness of features. A temporal mask is used to mask padding positions, ensuring that attention weights are allocated only to effective time steps. ③ The Semantic Aware Supervision Module (SAS) is used during the model training phase to apply regularization constraints to the shared feature representations of the model using the hierarchical semantic structure of pig behavior: in the shared global temporal representation... Based on this, a hierarchical multi-head classification structure is constructed, including a fine-grained master classifier and multiple auxiliary classifiers corresponding to different semantic levels of behavior. Each classifier uses a linear mapping for prediction, and its output format is as follows: ,in Z (K) Indicates the first K The input features of each classifier head. The final fine-grained master classifier head directly acts on the shared representation. h Each auxiliary classification head then acts on the auxiliary representation modulated by the progressive gradient refeed. h aux If auxiliary semantic supervision is directly applied to the shared representation... In the early stages of training, coarse-grained semantic constraints may impose excessive restrictions on fine-grained discriminative features that are not yet fully developed, thereby weakening the model's ability to model fine-grained behavioral differences. To avoid this problem, this invention introduces a progressive gradient backfeeding mechanism to structurally regulate the constraint strength of auxiliary semantic supervision. Its auxiliary representation is defined as: ,in This represents an operator that blocks gradient propagation. To fix the refeedback coefficient, it is gradually increased during training. This design allows the auxiliary semantic supervision to primarily stabilize the representation structure in the early stages of training. As training progresses, the semantic constraints gradually strengthen, guiding the shared representation towards a semantically consistent structure without compromising fine-grained discriminability. The overall training objective employs a weighted multi-task cross-entropy loss. ,in To achieve the final fine-grained classification loss, class weighting is introduced to alleviate the class imbalance problem; Indicates the first K Cross-entropy loss for auxiliary semantic tasks, λ k These are the corresponding weighting coefficients.

[0041] Example 2 The animal behavior detection system based on the method described in Example 1 includes: ① a wearable smart sensing ear tag for collecting six-axis inertial sensor data from pigs; ② an edge computing gateway deployed at the pigsty for receiving data uploaded by the ear tag, executing the method described in Example 1, and outputting structured behavior recognition results; and ③ a cloud management platform for receiving and storing the behavior recognition results uploaded by the edge computing gateway, and performing statistical analysis, visualization, and historical data backtracking.

[0042] In this detection system, the ear tag sensing layer is only responsible for the continuous acquisition and encapsulation of raw motion data. A six-axis IMU sensor synchronously acquires three-axis acceleration and three-axis angular velocity data at a preset frequency, and caches and uploads them as timestamped data packets, avoiding computational load on the sensing end and ensuring low power consumption and long-term stable operation of the ear tag. The edge computing layer undertakes the core tasks of data preprocessing and real-time behavior inference. First, it filters and reduces noise from the uploaded inertial data and divides it into fixed time windows. Then, the MSTANet model deployed on the edge gateway infers the data for each window, directly outputting the identified behavior category information and key statistical information. Finally, it only uploads the final results to the cloud. The cloud management layer no longer undertakes real-time computing tasks, mainly responsible for centralized storage of behavior data, statistical analysis across time scales, visualization, and historical data backtracking. This edge-cloud collaborative system significantly reduces the communication bandwidth and equipment energy consumption pressure caused by continuous transmission of high-frequency raw time-series data without sacrificing behavior recognition accuracy, and significantly improves the system's robustness in the complex network environment of pig farms.

[0043] Example 3 (Comparative Example) This invention selected four classic machine learning algorithms—K-Nearest Neighbors (KNN), Logistic Regression, Support Vector Machine (SVM), and Random Forest (RF)—as comparison methods, and the results are shown in Table 2.

[0044] Table 2. Accuracy Comparison between Traditional Machine Learning Methods and MSTANet

[0045] Experimental results show that although traditional machine learning methods have certain advantages in terms of computational overhead, their performance is highly dependent on manual feature construction, making it difficult to effectively characterize the complex nonlinear temporal behavior patterns in multidimensional inertial signals. Therefore, under the task setting of this invention, which uses simplified features and is geared towards end-to-end fine-grained semantic discrimination, its recognition performance is significantly limited. In contrast, MSTANet does not require manual intervention in feature engineering and can adaptively mine discriminative features directly from raw inertial sensor data, making it more suitable for the actual needs of large-scale aquaculture scenarios for both the accuracy and automation of behavior recognition.

[0046] MSTANet was compared with several typical deep temporal models, including CNN+BiLSTM, ResNet and Vision Transformer (ViT), and the results are shown in Table 3.

[0047] Table 3 Performance Comparison of Typical Deep Temporal Models and MSTANet

[0048] It can be seen that although the aforementioned models are significantly higher than MSTANet in terms of parameter scale and computational complexity, their recognition accuracy does not surpass that of MSTANet. Among them, CNN+BiLSTM relies on recurrent structures for temporal modeling, resulting in high FLOPs, but it struggles to simultaneously model short-term and long-term behaviors under fixed time windows. While ResNet and ViT possess strong feature representation capabilities, their structural designs are primarily geared towards two-dimensional data or long-sequence scenarios, failing to fully leverage their advantages under multi-dimensional inertial temporal signals. In contrast, MSTANet is designed around edge-deployment constraints, offering an effective solution to the challenge of unified modeling of short-term and long-term behaviors under fixed time windows, thus achieving higher recognition accuracy while significantly reducing model complexity.

[0049] MSTANet was compared with several typical lightweight network architectures, including the MobileNet and ShuffleNet series models. The results are shown in Table 4.

[0050] Table 4 Performance Comparison of MSTANet and Mainstream Lightweight Networks

[0051] As can be seen, although some lightweight networks have advantages over classic deep networks in terms of parameter size or computational overhead, their recognition performance is lower than that of MSTANet, and they consume more resources.

[0052] The results above demonstrate that the proposed lightweight multi-scale semantic-aware temporal network achieves a good balance between recognition accuracy and computational efficiency in real-world pig farm scenarios.

[0053] This invention designs a large-scale temporal modeling module (LTM) that effectively captures global motion patterns and long-term behavioral rhythms while reducing sequence length and computational cost. This provides a reliable feature representation foundation for behaviors with relatively stable temporal characteristics, such as eating and lying down. Meanwhile, the introduction of a small-scale temporal refinement module (STR) further expands the model's temporal modeling capabilities, adaptively enhancing informative temporal segments related to short-term, highly discriminative behaviors. An attention-based module selectively reallocates temporal importance, compensating for the weakening of key features during downsampling, thereby improving sensitivity to non-periodic and localized motion patterns. Furthermore, the semantically aware supervision mechanism (SAS) introduced into the model further enhances the discriminative ability between fine-grained behaviors. This mechanism effectively alleviates confusion between highly similar semantics in pig behavior and improves classification accuracy with almost no increase in computational cost. This mechanism does not introduce additional feature extraction branches but instead utilizes the inherent semantic hierarchy of pig behavior to impose structured constraints on the shared feature space. From the perspective of representation learning, hierarchical supervision plays a role in semantic regularization, which can guide the organization of the latent space towards a semantically consistent structure.

[0054] In the constructed edge-cloud collaborative pig behavior monitoring system, the data gateway receives inertial motion data from multiple pigs in a one-to-many, time-division multiplexing manner. The ear tag side employs a batch upload mechanism based on storage sectors, with each sector containing 340 frames of six-axis inertial data samples, serving as the basic data unit for gateway reception and forwarding. Under this system architecture, pig behavior data can be processed in two typical modes: cloud-based centralized computing and edge-side inference. In cloud computing mode, the data gateway uploads the complete raw inertial data to the cloud management platform, where data storage and behavior recognition are performed. At a sampling frequency of 10 Hz, the raw uploaded data volume for a single data sector is 4094 bytes. In contrast, in edge inference mode, the behavior recognition task is completed on the data gateway side, with only the structured recognition results and necessary statistical information uploaded to the cloud. In this case, the data volume uploaded in a single session is only 17 bytes, representing approximately 0.4% of the data upload scale in cloud computing mode.

[0055] In large-scale farms where numerous ear tags are deployed simultaneously and long-term continuous monitoring is required, continuously uploading high-frequency raw time-series data will significantly consume communication bandwidth, easily leading to network congestion, transmission delays, and even data packet loss. By performing inference at the gateway side and only uploading compact behavioral results, the communication load can be significantly reduced without sacrificing recognition accuracy, which is particularly critical for farming environments with limited bandwidth or unstable network conditions.

Claims

1. An animal behavior recognition method based on a multi-scale semantic-aware temporal network, characterized in that, The animal behavior recognition method based on multi-scale semantic-aware temporal networks includes the following steps: (1) Collect raw motion data of animals through inertial sensors and preprocess the raw motion data to obtain a clean inertial time series signal; (2) The inertial time series signal is segmented using a sliding time window of fixed length to construct model input samples; (3) Input the model input samples into the pre-trained multi-scale semantic perception temporal network MSTANet model, and the MSTANet model outputs animal behavior recognition results; the MSTANet model includes: ①Large-scale temporal modeling module (LTM) is used to downsample the input sequence and expand the receptive field to capture long-term stable behavioral features; ② The small-scale temporal refinement module STR is used to adaptively reweight the features output by the LTM module in the time dimension to enhance the capture of short-term transient action features; ③ Semantic Aware Supervision Module (SAS) is used to apply regularization constraints to the shared feature representations of the model during the model training phase by utilizing the hierarchical semantic structure of animal behavior.

2. The animal behavior recognition method based on multi-scale semantic-aware temporal networks as described in claim 1, characterized in that, The preprocessing in step (1) involves performing median filtering and Butterworth low-pass filtering on the triaxial acceleration or triaxial acceleration and triaxial angular velocity signals in sequence to remove noise, and performing high-pass filtering on the acceleration signal to separate the gravity component, thereby obtaining the body coordinate system acceleration signal that reflects the dynamic motion of the body.

3. The animal behavior recognition method based on a multi-scale semantic-aware temporal network as described in claim 1, characterized in that, The fixed-length sliding time window in step (2) has a length of 20-30 seconds and an overlap ratio of 40-60% between adjacent windows.

4. The animal behavior recognition method based on a multi-scale semantic-aware temporal network as described in claim 1, characterized in that, The large-scale temporal modeling module LTM described in step (3) adopts a lightweight convolutional network structure that includes depth-separable one-dimensional convolution and grouped pointwise convolution GLB, and compresses the time dimension through temporal downsampling with two-level strides.

5. The animal behavior recognition method based on a multi-scale semantic-aware temporal network as described in claim 4, characterized in that, The number of groups G in the grouped pointwise convolution GLB is 1-8.

6. The animal behavior recognition method based on a multi-scale semantic-aware temporal network as described in claim 1, characterized in that, The small-scale temporal refinement module STR mentioned in step (3) adopts a multi-head self-attention mechanism, and its calculation process is expressed as follows: , Where Q, K, and V are the query, key, and value matrices obtained by linear mapping of the input features, respectively. This is the scaling factor.

7. The animal behavior recognition method based on a multi-scale semantic-aware temporal network as described in claim 1, characterized in that, In step (3), the semantic awareness supervision module SAS constructs a fine-grained main classification head and at least one auxiliary classification head corresponding to a coarse-grained semantic level during the training phase; the input features of the auxiliary classification head are calculated through a progressive gradient backfeeding mechanism: , in h aux For shared feature representation, α is a fixed refeedback coefficient that gradually increases from 0 to 1 as the training progresses. stopgrad (·) This is an operator that blocks gradient propagation.

8. The animal behavior recognition method based on a multi-scale semantic-aware temporal network as described in claim 1, characterized in that, The MSTANet model employs weighted multi-task cross-entropy loss: , in, , To achieve the final fine-grained classification loss, class weighting is introduced to alleviate the class imbalance problem; , Indicates the first K Cross-entropy loss for auxiliary semantic tasks, λ k These are the corresponding weighting coefficients.

9. An animal behavior detection system based on the method of any one of claims 1 to 8, characterized in that, The animal behavior detection system includes: ① a wearable or implantable intelligent sensing terminal containing inertial sensors, used to collect inertial sensor data of animals; ② an edge computing gateway, deployed at the animal breeding facility, used to receive data uploaded by the intelligent sensing terminal, execute the method described in any one of claims 1-8, and output structured behavior recognition results; or, deployed on a management platform, used to execute the method described in any one of claims 1-8 on the data collected and transmitted to the management platform, and output structured behavior recognition results; ③ a cloud management platform, used to receive and store the behavior recognition results uploaded by the edge computing gateway, and perform statistical analysis, visualization, and historical data backtracking.

10. A computer-readable storage medium having a computer program stored thereon, characterized in that, The computer program executes the method as described in any one of claims 1-8.