Multi-modal based pet living area behavior detection and analysis method and system

By using multimodal sensors, collaborative sensing networks, and machine learning technology, the problem of detection accuracy and early warning in complex environments for pet health monitoring devices has been solved, achieving highly accurate pet behavior recognition and health risk early warning.

CN122245828APending Publication Date: 2026-06-19CENTRAL SOUTH UNIVERSITY OF FORESTRY AND TECHNOLOGY

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
CENTRAL SOUTH UNIVERSITY OF FORESTRY AND TECHNOLOGY
Filing Date
2026-05-21
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing pet health monitoring devices suffer from poor data accuracy, high error rates, and high false alarm rates, making it difficult to meet the needs of pet health management. Furthermore, their detection performance is unsatisfactory in complex home environments, and they cannot accurately identify pet behavior and health risks.

Method used

By deploying a variety of sensors with complementary physical principles to build a collaborative sensing network, multimodal heterogeneous data is acquired, spatiotemporal synchronization, feature extraction and multi-level fusion are performed, and personalized behavioral baseline models are established by combining machine learning to identify abnormal behavior patterns and provide health risk warnings.

Benefits of technology

It achieves highly robust and accurate pet behavior detection and health early warning, overcomes the failure of single sensors in complex environments, improves the accuracy of abnormal behavior identification and the timeliness of early warning, and provides personalized health management support for pets.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122245828A_ABST
    Figure CN122245828A_ABST
Patent Text Reader

Abstract

This invention discloses a method and system for detecting and analyzing pet behavior in their living areas based on multimodal sensing. The method specifically includes: constructing a collaborative sensing network using multiple sensors deployed in the pet's movement area to acquire multimodal heterogeneous raw sensor data; performing spatiotemporal synchronization and spatial coordinate system calibration on the raw sensor data, and then performing feature extraction and multi-level fusion to obtain a joint feature vector characterizing the pet's state and behavior; establishing a personalized behavioral baseline model for each pet, comparing the joint feature vector with the personalized behavioral baseline model to identify abnormal behavioral patterns deviating from the norm; matching the identified abnormal behavioral patterns with a preset abnormal pattern classifier to output proactive early warning information targeting specific health risks. This invention achieves intelligent analysis throughout the entire process, from raw signal acquisition, feature extraction, multimodal fusion to behavior recognition and health early warning.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of pet health monitoring technology, and in particular to a method and system for detecting and analyzing pet living area behavior based on multimodality. Background Technology

[0002] With the booming development of the pet economy and the widespread adoption of smart home devices, real-time pet health monitoring has become a focal point of great concern for pet owners.

[0003] However, current pet health monitoring devices on the market have many problems in practical applications and fail to meet the actual needs of pet owners. Most existing health monitoring devices are externally worn, which not only causes discomfort to pets and affects their normal activities, but also results in poor data accuracy, with an error rate as high as 0.5 kg and a false alarm rate exceeding 30%. Many pet owners regard these devices as "pseudo-technology" and believe they cannot provide a reliable basis for pet health management.

[0004] At the academic research level, research on animal behavior recognition and multimodal analysis is undergoing a profound evolution from traditional methods to deep learning. In recent years, a series of important research results have been achieved in this field. In 2024, Sadeghi E. et al. published research in the journal *Lecture Notes in Networks and Systems*, proposing the RayPet system. This system utilizes frequency-modulated continuous wave (FMCW) millimeter-wave radar technology to identify pet activities and postures. By deeply analyzing micro-Doppler spectrograms, it accurately captures the subtle movements of pets and, combined with machine learning algorithms, achieves an 89% recognition accuracy rate, effectively solving the discomfort caused to pets by traditional contact devices. In 2025, the research team of Professor Wei Pengfei from the Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, published a paper in the journal *eLife*, proposing an Anti-Drift Posture Tracking (ADPT) technology based on the Transformer architecture. This technology cleverly integrates the advantages of CNN and Transformer, successfully solving the keypoint drift problem in multi-animal interaction and long-term tracking, with an overall detection accuracy improvement of more than 8.6% compared to mainstream methods such as DeepLabCut. In the same year, Tianyang Xu et al. proposed the Keypoint Interactive Transformer (KIT) in the *International Journal of Computer Vision* for general mammalian pose estimation, achieving an accuracy of 77.9 AP on the AP10K dataset, significantly outperforming the HRFormer method. These research findings provided new ideas and technical support for pet behavior recognition and health monitoring, driving technological progress in this field.

[0005] In the application field, smart pet devices are rapidly developing towards multimodal and scenario-based applications. In the domestic market, professional brands such as Petkit and Homan focus on the research and development of products like smart litter boxes and automatic feeders. By integrating modules such as weighing and infrared sensing, they collect basic data, providing convenience for daily health monitoring of pets. Ecosystem brands like Xiaomi, relying on their vast smart home systems, actively promote data linkage between devices, further enhancing the intelligence level of pet health management. In the international market, UCloudlink's PetCam pet camera, launched at CES 2026, achieves multi-sensory interaction of "hearing, speaking, and seeing" through global network technology, marking a transformation of smart pet devices from "functional tools" to "emotional companions." Furthermore, smart pet cameras launched by AI technology companies such as Kuaitong Technology, based on a multimodal deep learning framework, use the YOLOv8 algorithm combined with optical flow and temporal models for dynamic tracking, achieving a leap from "seeing" to "understanding" pet body language, providing pet owners with deeper and more comprehensive information about their pets' behavior.

[0006] Despite this, current pet movement area detection technology still faces numerous challenges, including abnormal detection results and unsatisfactory performance of smart devices in complex home environments. A deeper analysis reveals that the root cause lies in the contradiction between the extreme diversity of pet morphology and behavior and the high cost of data annotation, making it difficult to establish universal feature representations in models. Specifically, pet faces lack stable key anchor points, and breed differences are significant, causing traditional key-point-based detection algorithms to drop in accuracy by more than 40%. Simultaneously, high-frequency texture features from fur coverage are often misinterpreted as noise, resulting in the loss of focus reference information and further impacting detection accuracy. Pet movement exhibits non-periodic characteristics; head twisting angular velocities can reach 120° / s, far exceeding those of humans, leading to a 68% failure rate in tracking focus during rapid pet movement using traditional focusing systems. Furthermore, actions such as jumping and rolling cause sudden changes in shooting distance, easily resulting in exposure imbalances such as "black face, white paws," severely affecting image quality. Dark fur absorbs light 1.8 times more than human skin, easily leading to overall underexposure, while white fur, due to its high reflectivity, causes highlight blowout. This extreme contrast renders traditional metering algorithms based on grayscale averages completely ineffective, failing to accurately capture pet image information. Furthermore, integrating heterogeneous data remains a core technical challenge. For example, the heart rate data sampling frequency of smart collars is 1Hz, while video behavior recognition can reach 30fps. Alignment errors in the temporal dimension distort the correlation analysis between behavior and physiological state. Algorithm bias is particularly prominent in rare behavior recognition. With common behavior samples accounting for over 85% of the training dataset, the model's accuracy in recognizing rare behaviors is less than 60%, failing to meet the demands for comprehensive and accurate detection in practical applications. Summary of the Invention

[0007] The purpose of this invention is to provide a method and system for detecting and analyzing pet living area behavior based on multimodal analysis. By integrating multimodal time-series information from sensors based on different physical principles and combining machine learning and data fusion technologies, it achieves intelligent analysis of the entire process from raw signal acquisition, feature extraction, multimodal fusion to behavior recognition and health warning, thereby solving at least one of the aforementioned problems in the prior art.

[0008] In a first aspect, the present invention provides a method for detecting and analyzing pet living area behavior based on multimodal approaches, the method specifically comprising: A collaborative sensing network is built by deploying multiple sensors in the pet's exercise area to acquire multimodal heterogeneous raw sensor data; The original sensor data is spatiotemporally synchronized and calibrated with a spatial coordinate system, and features are extracted and multi-level fused to obtain a joint feature vector that characterizes the pet's state and behavior. A personalized behavioral baseline model is established for each pet. The joint feature vector is compared with the personalized behavioral baseline model to identify abnormal behavioral patterns that deviate from the norm. The identified abnormal behavior patterns are matched with a preset abnormal pattern classifier to perform arthritis risk warning analysis, urinary disease risk warning analysis, and stress or pain warning analysis, and output proactive warning information for specific health risks.

[0009] Secondly, the present invention provides a multimodal pet living area behavior detection and analysis system, the system specifically comprising: The data acquisition module is used to build a collaborative sensing network through multiple sensors deployed in the pet's exercise area to acquire multimodal heterogeneous raw sensor data; The data fusion module is used to perform spatiotemporal synchronization and spatial coordinate system calibration on the raw sensor data, and to perform feature extraction and multi-level fusion to obtain a joint feature vector that characterizes the pet's state and behavior. The anomaly detection module is used to build a personalized behavior baseline model for each pet, compare the joint feature vector with the personalized behavior baseline model, and identify abnormal behavior patterns that deviate from the norm. The early warning analysis module is used to match the identified abnormal behavior patterns with the preset abnormal pattern classifier, and to perform early warning analysis for arthritis risk, urinary disease risk, and stress or pain, respectively, and output proactive early warning information for specific health risks.

[0010] Compared with the prior art, the present invention has at least one of the following technical effects: 1. This invention integrates multimodal time-series information from sensors based on different physical principles, and combines machine learning and data fusion technologies to achieve intelligent analysis of the entire process from raw signal acquisition, feature extraction, multimodal fusion to behavior recognition and health warning; 2. This invention aims to overcome the shortcomings of single sensors in high-speed movement, hair obstruction, and extreme lighting by integrating information from multiple modal sensors such as vision, thermal imaging, and millimeter-wave radar, thereby ensuring continuous and stable perception of the cat's core living area. 3. This invention solves the challenges of spatiotemporal alignment and deep correlation analysis of multi-source data, enabling accurate identification of abnormal behavior patterns. This leads to the construction of a highly robust and accurate non-disruptive algorithm system, providing core technical support for establishing personalized health baselines for pets and proactively warning of early disease risks. 4. This invention utilizes the Transformer encoder model to deeply fuse high-dimensional joint feature vectors, which can effectively learn the relationships between features and accurately represent the instantaneous state and behavior of pets. 5. This invention establishes a personalized behavioral baseline model and identifies abnormal behavior patterns, which can accurately determine abnormalities for each pet and improve the accuracy of abnormal behavior identification. 6. This invention obtains a personalized behavior baseline model through unsupervised training of a variational autoencoder, which can effectively learn the distribution of normal behavior and provide a reliable model for abnormal behavior identification. 7. This invention calculates statistical thresholds based on the reconstruction error distribution, which can scientifically distinguish between normal and abnormal behavior, thus improving the accuracy and reliability of abnormal behavior identification. 8. This invention matches abnormal behavior patterns with a preset classifier to perform various health risk early warning analyses, which can comprehensively assess the pet's health status and output early warning information in a timely manner; 9. This invention inputs different risk characteristics into corresponding classifiers and outputs early warning information in parallel, which can efficiently and accurately provide early warnings for different health risks, improving the timeliness and pertinence of early warnings. Attached Figure Description

[0011] To more clearly illustrate the technical solutions in the embodiments of this application, the drawings used in the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort. Figure 1 This is a flowchart illustrating a multimodal pet living area behavior detection and analysis method according to an embodiment of the present invention. Figure 2 This is a schematic diagram of the structure of a multimodal pet living area behavior detection and analysis system provided in an embodiment of the present invention. Detailed Implementation

[0012] In the following description, specific details such as particular system architectures and techniques are set forth for illustrative purposes and not for limitation, in order to provide a thorough understanding of the embodiments of this application. However, those skilled in the art will understand that this application may also be implemented in other embodiments without these specific details. In other instances, detailed descriptions of well-known systems, apparatuses, circuits, and methods have been omitted so as not to obscure the description of this application with unnecessary detail.

[0013] The following is an explanation of algorithm-related terminology: FMCW: Frequency-Modulated Continuous Wave; CNN: Convolutional Neural Network; IMU: Inertial Measurement Unit; LSTM: Long Short-Term Memory; VAE: Variational Autoencoder; NTP: Network Time Protocol; KL divergence: Kullback-Leibler divergence, a method for measuring the difference between two probability distributions.

[0014] In this application embodiment, the entity executing the process includes a terminal device. This terminal device includes, but is not limited to, devices capable of executing the methods disclosed in this application, such as servers, computers, smartphones, and tablets. Figure 1 A flowchart illustrating a multimodal pet living area behavior detection and analysis method according to an embodiment of the present invention is shown below: S101 constructs a collaborative sensing network by deploying multiple sensors in the pet's exercise area to acquire multimodal heterogeneous raw sensor data.

[0015] In this embodiment, to address the issue of single sensors easily failing in complex home environments, the present invention deploys a group of sensor nodes with complementary physical principles around the target monitoring area, forming a collaborative sensing network. The specific configuration and function are as follows: Visual sensors: Deployed in front of or diagonally above the area to acquire high-resolution color images / video streams. Their primary function is to capture the pet's appearance, posture, fine motor skills, and texture information, providing rich visual semantics. Thermal imaging sensor: Deployed in the same location as the vision sensor. Its core function is to identify the thermal outline of living organisms. It can provide stable presence detection unaffected by visible light in completely dark, bright light, or partially occluded (such as hair or lightweight fabric) scenarios, effectively compensating for the shortcomings of vision under extreme lighting or occlusion. Millimeter-wave radar sensors (FMCW radar): Deployed on ceilings or side walls. They work by transmitting and receiving frequency-modulated continuous waves to sense the distance, speed, and minute movements of targets. Their advantages include the ability to penetrate thin non-metallic obstructions, immunity to light, and no motion blur in high-speed motion, making them particularly suitable for capturing the fast, non-linear movements of pets. Inertial Measurement Unit / Pressure Sensor (IMU): The IMU can be built into a pet wearable device to directly measure the pet's triaxial acceleration and angular velocity, thereby calculating its motion posture. Pressure / deformation sensors can be integrated into contact surfaces such as the bottom of the cat bed and litter box to sense changes in the pet's contact, weight distribution, dwell state, and mechanical behavior patterns. Audio sensor: Used to collect ambient sounds and help identify sounds related to specific behaviors (such as sand-scraping sounds, eating sounds, and calls), providing contextual information for behavior recognition.

[0016] The core of this embodiment lies in the complementary advantages and redundant design of heterogeneous sensors. When the performance of any sensor degrades due to environmental challenges (such as visual overexposure caused by strong light, visual failure caused by darkness, or partial radar echo obstruction by hair), other sensors can provide redundant or complementary information, ensuring the robustness and continuity of perception at the system level and overcoming the fragility of traditional single-sensor solutions.

[0017] This embodiment systematically solves the failure problem of existing single sensing technologies in complex scenarios such as the diversity of pet morphology, high-speed non-periodic movement, hair occlusion, and extreme lighting by deploying a multimodal sensor collaborative sensing network with complementary physical principles. The heterogeneous combination of vision, thermal imaging, millimeter-wave radar, IMU / pressure, and audio sensors achieves information redundancy and complementary advantages. Specifically, when vision fails due to darkness, strong light, or occlusion, thermal imaging and millimeter-wave radar can provide stable detection of living beings unaffected by visible light interference, ensuring continuous monitoring; millimeter-wave radar does not produce blurring for high-speed movement, and combined with event camera and IMU data, it can accurately capture high-speed head movements of up to 120° / second, solving the high failure rate problem of traditional visual tracking; at the same time, this solution effectively overcomes the predicament of traditional photometric algorithms failing due to pet hair color, ensuring the accuracy and stability of perception at the system level.

[0018] S102 performs spatiotemporal synchronization and spatial coordinate system calibration on the raw sensor data, and performs feature extraction and multi-level fusion to obtain a joint feature vector for characterizing the pet's state and behavior.

[0019] In this embodiment, to solve the problems of "difficult alignment and difficult fusion", the multi-source heterogeneous raw data is transformed into unified, high-quality joint features that can be used for high-level analysis.

[0020] First, the data needs to be synchronized and aligned in time and space: 1) Time synchronization: All sensor data streams are timestamped using high-precision hardware triggering or Network Time Protocol (NTP). For data with different sampling rates (e.g., 1Hz heart rate and 30fps video), interpolation or resampling techniques are used to unify them to the same time reference, ensuring that behavioral events can be accurately correlated in multimodal data. 2) Spatial calibration: Through multi-sensor joint calibration, coordinate transformation relationships are established between visual, thermal imaging, and radar data, mapping all targets observed by all sensors to the same world coordinate system. This is a prerequisite for achieving multi-view information fusion.

[0021] Next, multimodal feature extraction is performed on the spatiotemporally synchronized and aligned data. For example, for visual data, pre-trained convolutional neural networks (CNN, YOLOv8 for object detection, HRNet or Keypoint Interactive Transformer for pose estimation) are used to extract features such as appearance, pose, and bounding boxes. For thermal imaging data, heat source contours and temperature distribution statistical features are extracted. For millimeter-wave radar data, the raw intermediate frequency signal is processed to generate range-Doppler maps or micro-Doppler spectra, and macroscopic motion velocity and micro-motion features of the target are extracted. For IMU data, temporal / frequency domain features such as motion energy, attitude angle, and action frequency are extracted. For pressure data, pressure center trajectory, dwell time, and pressure change patterns are extracted.

[0022] Then, multi-level feature fusion is performed on the extracted multimodal features, including feature-level fusion and decision-level fusion. In the feature-level fusion stage, the aligned feature vectors extracted from each modality are concatenated or weighted and fused using deep learning models (such as Transformer) with attention mechanisms to form a unified joint feature vector containing multi-dimensional information. This method can uncover deep correlations between features from different modalities. In the decision-level fusion stage, each modality makes an initial judgment independently, and then the initial decisions of each modality are combined through rules or models to arrive at the final conclusion.

[0023] The core of the multimodal feature extraction and multi-level feature fusion step lies in systematically solving the problems of spatiotemporal alignment and deep fusion of multi-source heterogeneous data. Accurate spatiotemporal alignment is the foundation for meaningful cross-modal correlation analysis. Multi-level fusion strategies can uncover deep correlations between different modalities. For example, by jointly modeling "specific postures recognized by vision," "local body temperature abnormalities shown by thermal imaging," and "specific frequency jitter detected by IMU," we can identify complex, clinically significant behavioral patterns that are difficult to discover with single modality or simple decision fusion, which is key to achieving the goal of "deep behavioral understanding."

[0024] This embodiment addresses the core challenges of "difficult alignment and fusion" of multi-source heterogeneous data through precise spatiotemporal synchronization, coordinate calibration, and feature-level / decision-level fusion. The system can accurately correlate information from different dimensions (such as posture, body temperature, micro-movements, sound, and pressure) in space and time, and utilizes deep learning models for deep feature fusion and joint modeling. This enables the system not only to recognize single actions but also to uncover complex behavioral patterns and intentions inherent in multimodal features. For example, it can correlate stiff posture, localized abnormal body temperature, and tremors at specific frequencies to identify complex behaviors with potential health implications, achieving a fundamental improvement in understanding pet behavior from simply "seeing" the surface to "reading" the underlying physiological and psychological states.

[0025] S103 establishes a personalized behavioral baseline model for each pet, compares the joint feature vector with the personalized behavioral baseline model, and identifies abnormal behavioral patterns that deviate from the norm.

[0026] In this embodiment, a long-term, personalized behavioral and physiological data profile is established for each cat. Using the fused features of the output, unsupervised or self-supervised learning algorithms (such as variational autoencoders (VAE) and contrastive learning) are employed to learn the distribution of the pet's behavioral patterns under healthy and normal conditions, thereby constructing a "personalized behavioral baseline." This baseline encompasses a series of personalized parameters, including entry and exit frequency, posture habits, movement patterns, and toileting behavior sequences.

[0027] Temporal models (such as Long Short-Term Memory networks LSTM and Transformers) are used to model continuous joint feature sequences to identify complex behavioral units. Real-time acquired and fused data is compared with established individual baselines. Abnormal behavioral sequences deviating from the individual's normal pattern are identified by calculating reconstruction error, distribution differences (such as KL divergence), or by applying specialized anomaly detection algorithms.

[0028] This embodiment innovatively introduces the concept of "personalized behavior baseline," establishing a normal behavior model for each pet through unsupervised learning. Combined with time-series models and anomaly detection algorithms, it achieves a shift from generalized judgment to personalized, accurate identification. This effectively overcomes the problem of low recognition rates for rare behaviors in general models due to sample bias.

[0029] S104 matches the identified abnormal behavior patterns with a preset abnormal pattern classifier to perform arthritis risk warning analysis, urinary disease risk warning analysis, and stress or pain warning analysis, and outputs proactive warning information for specific health risks.

[0030] In this embodiment, a knowledge base or classifier of "abnormal patterns" related to known health risks is defined or trained. When a detected abnormal pattern matches a preset rule, the system triggers an alert.

[0031] To predict arthritis risk, an arthritis risk classifier is trained. The training steps include: 1. Input parameters: Visual-posture features: Coordinate sequences (x, y, confidence) of 17 body key points extracted from the Keypoint Interactive Transformer (KIT) model, with particular attention to the relative angles, range of motion, and smoothness of hind limb joints (hip, knee, ankle); IMU motion features: Triaxial acceleration and gyroscope data from wearable devices, extracting peak impact, gait cycle symmetry index, and distribution of kinetic energy in the vertical / horizontal directions; Pressure distribution features: Pressure sensor array data from smart cat beds / floor mats, extracting pressure center trajectory, average pressure in the hind limb region, and pressure distribution asymmetry index (comparing weight-bearing on the left and right hind limbs).

[0032] 2. Model training process: Data preparation: Collect normal activity data of target cats in a healthy state over several weeks; collect multimodal behavioral data segments of cats that have been diagnosed with arthritis by a veterinarian (or are clearly asymptomatic) and label them (normal / early risk / obvious risk). Phase 1 Training (Unsupervised): The VAE is trained using the target cat's baseline health data; the loss function is: reconstruction loss (MSE) + KL divergence (normalized latent variable distribution); after training, the reconstruction error of all healthy samples is calculated, and a threshold is set (e.g., µ + 2σ). Phase Two Training (Supervised): The classifier is trained using a labeled dataset; the input is a multimodal feature sequence within a fixed time window (e.g., 10 seconds); the loss function is weighted cross-entropy loss (to address class imbalance); the Adam optimizer is used; validation is performed using k-fold cross-validation to ensure the model's generalization ability; real-time data, after feature extraction and fusion, is simultaneously input into the trained VAE and the classifier.

[0033] The following conditions must be met simultaneously to trigger an alert: a) The VAE reconstruction error consistently exceeds the personalized threshold; b) The risk probability output by the classifier exceeds a preset threshold (e.g., P(risk) > 0.7). Once the conditions are met, the system generates an alert: "Suspected arthritis risk behavior pattern detected: jump height decreased by 20% from the baseline, and landing posture asymmetry increased. Please observe carefully."

[0034] For the risk prediction of urinary tract diseases, a urinary tract disease risk classifier is trained. The training steps include: 1. Input parameters: Pressure and weight sequence: data from pressure sensors at the bottom of the litter box, with core inputs including total weight change, valid excretion event indicators (sudden weight drop and rebound consistent with excretion patterns), and number of invalid contacts; Visual sequence: YOLOv8 detected entry and exit events from the litter box, and KIT posture estimation of the in-litter posture (such as prolonged squatting without excretion); Audio features: specific sound events identified by audio models (such as VGGish), such as "painful cries" and "frequent scratching sounds"; Temporal and frequency features: frequency of entering the litter box, duration of a single stay, and number of invalid visits within 24 hours.

[0035] 2. Model training process: Data annotation: A large number of "toilet behavior segments" were annotated and marked as "normal", "suspicious" (frequent entry and exit but no excretion) and "abnormal" (painful expression but no excretion); LSTM and scorer training: This ensures the model outputs significantly higher scores for "abnormal / suspicious" segments than for "normal" segments; Loss function: Contrastive Loss or Triplet Loss encourages the model to widen the gap between normal and abnormal patterns in the feature space; Rule base construction: Collaborating with veterinary experts to define quantitative rules for "urinary tract disease risk," such as: "Entering the litter box more than 5 times within 2 hours, and more than 60% of visits without recording effective changes in excretory weight," or "Squatting in the litter box for more than 3 minutes at a time, and posture analysis showing continuous abdominal tension." Threshold calibration: Using the validation set, adjust the threshold for anomaly scores and the frequency / duration threshold in the rules to achieve a balance between false positive and false negative rates; Reasoning and Early Warning: The system analyzes toilet-related event flows in real time. When an abnormal pattern is output by the LSTM model and the logical conditions of the rule engine are met, a high-level early warning is triggered: "A highly suspected lower urinary tract disease behavior pattern has been detected: 7 invalid visits in the past 2 hours, accompanied by painful postures. It is recommended to seek medical attention promptly."

[0036] To train a stress or pain classifier for stress or pain warnings, the training steps include: 1. Input parameters: Thermal imaging features: local temperature of specific body parts (such as joints and abdomen) and its temperature difference from the average body temperature; behavioral and postural features: hiding time, frequency and duration of grooming behavior, activity level (IMU data), and degree of curling posture (KIT key point calculation); audio features: vocal frequency, tone features (such as high-frequency groaning), and snoring patterns output by the audio event detection model (abnormal snoring may be related to pain); environmental context: possible external stressors such as visitor records for the day and environmental noise levels.

[0037] 2. Training process: Data preparation and annotation: High-quality labeled data is required, specifying the time periods when cats experienced stress (e.g., after a stranger's visit) or pain (e.g., during post-operative recovery). Data collection is extremely challenging; Self-supervised pre-training: A graph contrastive learning method is used on a large amount of unlabeled data. Positive and negative sample pairs are generated through data augmentation (such as randomly masking certain modalities and shuffling the time order), and the GNN encoder is trained to learn a discriminative graph representation that can distinguish between normal behavior graphs and abnormal behavior graphs; Supervised fine-tuning: On labeled data, a pre-trained GNN encoder and a classifier head are fine-tuned using graph-level labels (normal / stress / pain); Loss function: cross-entropy loss; Inference and Early Warning: Real-time data is segmented into sliding time windows, with each window constructing a graph, which is then input into a trained GNN model. When the model outputs a probability of "pain" or "stress" exceeding a threshold, and the confidence level of key associated features is high, an early warning is generated: "Suspected signs of pain detected: Persistent high temperature in the left hind limb area over the past 3 hours, accompanied by abnormally high frequency of grooming behavior, and a 70% decrease in activity level; attention is advised."

[0038] This embodiment defines an "abnormal pattern" classifier based on multimodal associations, enabling the system to proactively identify behavioral sequences that deviate from the baseline and are associated with specific disease risks, thereby triggering accurate health risk warnings. This marks a shift in pet health monitoring from "passive data recording" to "proactive health intervention," which not only improves pet welfare and alleviates owner anxiety but also provides core technological support for the smart pet products industry to upgrade towards high-value-added, precise management services, yielding significant social and economic benefits.

[0039] Combining steps S101 to S104 above, taking the monitoring and urinary health early warning system for the litter box area in the pet exercise area as an example, the system collects data collaboratively by deploying multimodal sensors and uses the Network Time Protocol (NTP) for time synchronization. The system performs spatiotemporal alignment on the collected video, heatmap, radar signals, pressure, and audio data, extracts features using models such as YOLOv8, KIT, FFT, and VAE, and then performs deep fusion via a Transformer encoder. Based on the personalized behavioral baseline established during the learning period, the system uses time-series models such as LSTM to monitor behavioral sequences in real time. By matching these sequences with a pre-set "abnormal pattern - health risk" knowledge base (such as urinary tract disease risk rules: frequent entry, prolonged stay, no effective excretion), the system ultimately achieves a complete closed loop from multimodal perception to proactive health early warning.

[0040] Specifically, firstly, a high-definition camera and a thermal imaging sensor (co-located) are deployed in the target area of ​​the litter box to collect RGB video streams and infrared thermal images to obtain visual appearance, posture, and thermal outline information of living beings; secondly, a frequency-modulated continuous wave millimeter-wave radar is deployed directly above the litter box to collect the radar's raw intermediate frequency signal to penetrate non-metallic obstructions and sense the target's distance, speed, and micro-movements; thirdly, a high-precision array-type pressure sensor is integrated at the bottom of the litter box to collect time-series data on pressure distribution and total weight changes to sense contact, residence, and excretion events; and fourthly, a directional microphone is deployed on the side of the litter box to collect environmental audio waveform data to identify sounds related to specific behaviors.

[0041] All sensors are connected to the edge computing gateway via wired or wireless networks. The gateway uses Network Time Protocol (NTP) or hardware trigger signals to assign a unified, high-precision timestamp to all the multi-source heterogeneous data streams (video frames, thermal imaging frames, radar signals, pressure readings, and audio streams) collected by the sensors, achieving millisecond-level time synchronization. This is a prerequisite for subsequent cross-modal data correlation analysis.

[0042] During system initialization, the transformation matrix from each sensor coordinate system to a unified world coordinate system with the geometric center of the litter box as the origin is calculated using the checkerboard calibration method (for visual / thermal imaging) and the reference target calibration method (for millimeter-wave radar). During runtime, these matrices are used to transform target position information from different sensors (such as radar point clouds, thermal centroids, and visual detection box centers) to the same spatial reference system in real time, achieving spatial alignment.

[0043] For the spatiotemporally aligned raw data, different models are used in parallel to extract high-level features. Specifically, this includes: using a YOLOv8 object detection model to extract the bounding box coordinates of the cat from the RGB video stream; using a Keypoint Interactive Transformer pose estimation model to extract the coordinate sequence of 17 key points on the cat's body (pose features); using image processing algorithms (such as contour extraction and threshold segmentation) to extract the heat source contour, area, and average temperature from the thermal image; using digital signal processing algorithms (such as Fast Fourier Transform FFT) to process the radar intermediate frequency signal, generate the range-Doppler spectrum, and extract the target range, radial velocity, and micro-Doppler spectral features from it; using time-domain / frequency-domain analysis to extract the weight value, pressure center trajectory, and dwell time from the pressure data; and using a pre-trained audio event detection model (such as VGGish) to analyze the audio waveform, identify and output specific sound event labels (such as "scratching sound" and "meow sound") and their timestamps.

[0044] The aligned feature vectors (pose, heat source, radar motion, pressure, and sound events) extracted in the above steps are concatenated along the time dimension to form a high-dimensional joint feature vector. This vector is then input into a Transformer encoder model based on an attention mechanism. This model can automatically learn the importance of different modal features in different contexts (e.g., automatically increasing the weight of radar and thermal imaging features when vision is occluded), achieving deep, adaptive feature-level fusion, and ultimately outputting a unified fused feature representation that comprehensively characterizes the cat's state and behavior in the current scene.

[0045] During the initial 1-2 week "learning period" after system deployment, the system collects all "fusion feature sequence" data generated by the cat in a healthy state within the litter box area. Subsequently, unsupervised learning methods, such as variational autoencoders, are used to model these normal temporal fusion feature data, learning their statistical distribution in the latent space, thereby establishing a personalized "normal behavior baseline model" for the cat. This baseline includes personalized parameters such as typical entry and exit frequencies, posture habits, dwell time range, and weight change patterns.

[0046] Once the routine monitoring period begins, the system processes newly generated fused feature data in real time. Using time-series models, such as Long Short-Term Memory networks, it models continuous fused feature sequences, predicts features for the "next moment," and calculates the reconstruction error between the actual and predicted features. When the reconstruction error consistently exceeds a threshold set based on the baseline model, it is considered a behavioral anomaly. For example, by comparing real-time data with the baseline, the system identifies abnormal patterns such as "abnormally high entry frequency," "abnormally prolonged dwell time," and "significantly increased weight detected by pressure sensors after multiple contacts but without recording excretion characteristics."

[0047] The system has a built-in knowledge base of "abnormal patterns - health risks" association rules. When a combination of multimodal abnormal patterns detected in real time (such as frequent entry, prolonged stay, and no effective excretion) highly matches the preset "suspected lower urinary tract disease risk" rule in this knowledge base, the system will trigger an early warning logic. Finally, the system automatically generates a health warning message containing a specific description of the abnormality and suggestions, and pushes it to the user's mobile app and other terminals via the application programming interface (API), completing a closed loop from multimodal perception to proactive health care.

[0048] In one possible implementation, the training steps of the attention-based Transformer encoder model include: 1. Input and output of the Transformer encoder model: Input: A tensor of shape (T, D), where: T (Time Steps): The length of the fusion time window. For example, with 5 sampling points per second, a 10-second window would have T=50; D (Total Feature Dimensions): Composed of the concatenated feature vectors from all modalities. Specifically, it can be broken down into: Attitude Features: 17 keypoints × 3 values ​​contained in each keypoint (x-coordinate, y-coordinate, confidence level) = 17 × 3 = 51 dimensions; Thermal Imaging Features: Heat source outline, area, average temperature, etc., assumed to be 10 dimensions; Radar Motion Features: Range, velocity, micro-Doppler spectrum features, etc., assumed to be 20 dimensions; Pressure Features: Weight, pressure center coordinates (x, y), dwell state, etc., assumed to be 5 dimensions; Sound Event Features: Audio embedding vectors extracted using models such as VGGish, assumed to be 128 dimensions; Total: Assume D≈214 dimensions (51+10+20+5+128, actual dimensions can be adjusted); Before input, a learnable positional encoding is added to the feature vector at each time step to inject the temporal information of the sequence; Output: A tensor of shape (T, D_model), or a pooled vector of shape (D_model,). Where D_model (the dimension of the hidden layers) is typically set to be greater than or equal to the input feature dimension D, for example, 256 dimensions. The output features incorporate contextual information from all time steps and all modalities, making them more discriminative than the original concatenated features.

[0049] 2. Training process of the Transformer encoder model: The Transformer encoder is not trained in isolation, but rather as part of the overall end-to-end "behavior recognition and warning" system. Its training objective is to generate optimal feature representations for downstream tasks such as arthritis risk classification and abnormal behavior detection.

[0050] Training objective (loss function): The model itself does not have an independent loss function. Its training is driven by the loss function of the downstream task. For example, for anomaly detection, it might be based on contrastive loss or reconstruction loss (such as VAE); for health risk classification, it would be the standard cross-entropy loss. The parameters of the Transformer encoder are updated with backpropagation of gradients to learn how to fuse features so that the final output fused features minimize the loss of the downstream task.

[0051] Training data and process: Data preparation: Collect a large amount of multimodal sensor data with time-series annotations. Perform spatiotemporal alignment and feature extraction on the data, and construct input sequence samples with (T,D) dimensions.

[0052] Joint training: a. Input the multimodal feature sequence into the Transformer encoder; b. Pass the encoder's output (pooled vector) to a task-specific head network, such as a fully connected classifier or a regressor; c. Calculate the task loss and backpropagate to update the parameters of the entire network, including the Transformer encoder and the task head network; Pre-training: First, the Transformer encoder is pre-trained on a large-scale unlabeled multimodal time series dataset through self-supervised learning, so that it learns general time series and cross-modal representations, and then fine-tuned for downstream tasks. Attention mask: can be used to handle missing values ​​that may exist in a sequence (such as missing data from a certain sensor at a certain moment).

[0053] In some embodiments, step S102 above, which involves performing spatiotemporal synchronization and spatial coordinate system calibration on the original sensor data, and extracting features and performing multi-level fusion to obtain a joint feature vector characterizing the pet's state and behavior, specifically includes: By applying a unified timestamp reference to the raw sensor data, interpolating or resampling data at different sampling rates, and performing joint spatial calibration on each sensor to establish coordinate transformation relationships, aligned multi-source heterogeneous data is obtained. Based on aligned multi-source heterogeneous data, each data is processed through a corresponding feature extraction model to obtain posture and appearance features, heat source distribution features, target motion and micro-motion features, mechanical behavior features, and sound event features. The posture and appearance features, heat source distribution features, target motion and micro-motion features, mechanical behavior features and sound event features are spliced ​​together in the time dimension to form a high-dimensional joint feature vector. The high-dimensional joint feature vector is input into the attention-based Transformer encoder model for deep feature-level fusion, and the output is a joint feature vector that represents the pet's instantaneous state and behavior in the current scene.

[0054] In this embodiment, spatiotemporal synchronization and spatial coordinate system calibration are performed on the raw sensor data. For time synchronization, a unified timestamp reference is applied to all raw sensor data. Since different sensors may have inconsistent sampling rates during data acquisition—for example, some sensors may have higher sampling frequencies than others—this can lead to data not being directly aligned in the time dimension. Therefore, interpolation or resampling methods are used for data with different sampling rates. Interpolation inserts new data points between known data points to increase data density; resampling changes the sampling frequency of the data to match the time intervals of other sensor data. These two methods ensure that the data acquired by different sensors are accurately correlated in time.

[0055] In terms of spatial coordinate system calibration, joint spatial calibration is performed on each sensor to establish coordinate transformation relationships. Different sensors may be located in different spatial positions, and their own coordinate systems may also differ. To accurately integrate the data from each sensor into a unified spatial framework, it is necessary to determine the coordinate transformation relationships between them. For example, by setting specific calibration objects in the pet's movement area, the rotation, translation, and other parameters between the different sensor coordinate systems are calculated using the measurement data of each sensor on the calibration objects, thereby establishing a unified coordinate transformation model. After the above time synchronization and spatial coordinate system calibration processing, aligned multi-source heterogeneous data are obtained. These data achieve accurate correspondence in both time and spatial dimensions, providing a reliable foundation for subsequent feature extraction.

[0056] Based on aligned multi-source heterogeneous data, each data point is processed using a corresponding feature extraction model to obtain different types of features.

[0057] For posture and appearance features, image processing and computer vision technologies are used to analyze pet image data and extract information such as the pet's body posture, limb movements, and appearance. For example, by recognizing the pet's body outline and joint positions, it can be determined whether the pet is standing, sitting, or lying down, as well as its fur condition, body shape changes, and other appearance features.

[0058] Heat source distribution characteristics are obtained using thermal imaging sensors. These sensors capture the temperature distribution on the pet's body surface and, by analyzing the thermal image data, extract the heat source distribution characteristics of different parts of the pet's body. For example, it can determine which parts of the pet's body have higher temperatures and whether there are any abnormalities such as localized overheating. These characteristics are of great significance for monitoring the pet's health.

[0059] The target's motion and micro-motion features are primarily obtained through the analysis of video data or motion sensor data. For video data, motion detection algorithms are used to track the pet's movement trajectory within video frames, calculating parameters such as the pet's speed and acceleration to obtain the pet's macroscopic motion features. Simultaneously, for subtle movements, such as minute vibrations caused by the pet's breathing and heartbeat, high-precision motion sensors are used to collect and analyze data, extracting micro-motion features.

[0060] Mechanical behavioral characteristics are acquired using mechanical sensors. Deploying mechanical sensors, such as pressure sensors, in the pet's movement area allows these sensors to detect changes in the pressure the pet exerts on the ground as it moves within the area. Analysis of this pressure data allows for the extraction of the pet's mechanical behavioral characteristics, such as gait and weight distribution. These characteristics help determine the pet's mobility and health status.

[0061] Sound event features are obtained by analyzing sound data collected by audio sensors. Using sound recognition technology, various sounds emitted by pets, such as barks and breathing sounds, are identified, and parameters such as frequency, intensity, and duration are analyzed to extract sound event features. Different sound events may reflect different emotions or health conditions in pets; for example, a pet's painful cries may indicate physical discomfort.

[0062] In the steps described above, posture and appearance features, heat source distribution features, target motion and micro-motion features, mechanical behavior features, and sound event features were acquired. To integrate these different types of features, they were concatenated along the time dimension. Since these features were acquired within the same time frame, concatenation along the time dimension allows them to be combined into a high-dimensional feature vector. This high-dimensional joint feature vector contains information about the pet from multiple aspects, providing a more comprehensive reflection of the pet's state and behavior. For example, at a given point in time, the high-dimensional joint feature vector contains not only the pet's posture information but also information on heat source distribution, motion state, mechanical behavior, and sound features at that moment, providing a rich data foundation for subsequent deep feature fusion.

[0063] The high-dimensional joint feature vector is input into a Transformer encoder model based on an attention mechanism for deep feature-level fusion. The Transformer encoder model is a powerful deep learning model capable of automatically learning important features and relationships between features in data. Based on the attention mechanism, the model can focus on the features in the high-dimensional joint feature vector that play a crucial role in representing the pet's state and behavior, while ignoring some irrelevant or secondary features. Through this deep feature-level fusion, the model can uncover deeper information hidden in the high-dimensional joint feature vector, further extracting and integrating the pet's state and behavioral features. Finally, the Transformer encoder model outputs a joint feature vector representing the pet's instantaneous state and behavior in the current scene. This joint feature vector accurately and comprehensively reflects the pet's state and behavior at the current moment, providing important evidence for subsequent abnormal behavior pattern recognition and health risk warning analysis.

[0064] Furthermore, the aligned multi-source heterogeneous data is processed through corresponding feature extraction models to obtain posture and appearance features, heat source distribution features, target motion and micro-motion features, mechanical behavior features, and sound event features, specifically including: Based on aligned visual data, the bounding box coordinates of the pet are extracted through a pre-trained convolutional neural network object detection model, and the coordinate sequence of multiple key points of the pet's body is extracted through a pose estimation model as pose and appearance features. Based on aligned thermal imaging data, heat source contours are extracted and temperature distribution statistics are performed using image processing algorithms to obtain heat source area, average temperature and local temperature difference features, which are used as heat source distribution features. Based on aligned millimeter-wave radar data, a range-Doppler spectrum is generated through fast Fourier transform, and the target's range, radial velocity, and micro-Doppler spectral features are extracted from the range-Doppler spectrum as target motion and micro-motion characteristics. Based on aligned IMU data and pressure data, weight change value, pressure center trajectory, residence time and pressure distribution asymmetry index are extracted through time domain and frequency domain analysis as mechanical behavior characteristics. Based on aligned audio waveform data, a pre-trained audio event detection model is used to identify and output sound event labels and their timestamps related to specific behaviors, which serve as sound event features.

[0065] In this embodiment, the visual data typically contains rich information about the pet's appearance and posture. To accurately extract the desired features from this data, a pre-trained convolutional neural network object detection model is employed. This model, trained on a large amount of image data, possesses powerful object recognition capabilities. When applied to aligned visual data, the model can quickly and accurately locate the pet's position in the image and output the pet's bounding box coordinates. The bounding box coordinates clearly define the pet's extent in the image, providing a foundation for subsequent feature extraction. After obtaining the bounding box coordinates, a pose estimation model is further used. The pose estimation model is specifically designed to identify the key points of an object's body. After inputting the visual data into this model, it extracts multiple key point coordinate sequences from the pet's body. These key points cover important parts of the pet's body, such as the head and limb joints. Through these key point coordinate sequences, the pet's posture can be clearly depicted, such as standing, sitting, and lying down, while also reflecting changes in the pet's physical appearance, such as the degree of body curvature. Therefore, using the bounding box coordinates and key point coordinate sequences together as pose and appearance features can comprehensively and accurately characterize the pet's state at the visual level.

[0066] Aligned thermal imaging data records the temperature distribution on the pet's body surface. To extract valuable heat source distribution features from this data, image processing algorithms are employed. First, heat source contours are extracted. By setting an appropriate temperature threshold, areas with temperatures above the threshold are identified as heat sources, and their contours are delineated. The heat source contours visually represent the size and shape of the heated areas on the pet's body. Next, temperature distribution statistics are performed, calculating features such as the area of ​​the heat source, average temperature, and local temperature difference. The heat source area reflects the size of the heated area, the average temperature reflects the overall temperature level of that area, and the local temperature difference reveals temperature differences between different parts of the pet's body. These features are crucial for monitoring the pet's health; for example, abnormally high local temperatures may indicate inflammation or infection. Therefore, using heat source area, average temperature, and local temperature difference features as heat source distribution characteristics can effectively characterize the pet's state at the thermal imaging level.

[0067] Aligned millimeter-wave radar data contains rich information about pet movement and micro-motions. Millimeter-wave radar senses the position and motion of target objects by transmitting and receiving millimeter-wave signals. To extract target motion and micro-motion features from this data, the millimeter-wave radar data is first processed using a Fast Fourier Transform (FFT) to generate a range-Doppler spectrum. The range-Doppler spectrum is a two-dimensional image where the horizontal axis represents the target's radial velocity and the vertical axis represents the distance between the target and the radar. Key motion information about the target can be extracted from the range-Doppler spectrum, including the target's distance, radial velocity, and micro-Doppler spectral features. The target's distance reflects the spatial relationship between the pet and the radar, while the radial velocity represents the pet's speed of movement towards or away from the radar. Micro-Doppler spectral features are frequency changes caused by the pet's minute body movements, such as breathing, heartbeat, and limb micro-movements. These minute motion features can provide detailed information about the pet's physiological state and behavioral patterns; for example, changes in breathing frequency may reflect the pet's emotional state or health condition. Therefore, by using the target's range, radial velocity, and micro-Doppler spectrum characteristics as the target's motion and micro-motion characteristics, the movement state of the pet at the millimeter-wave radar level can be comprehensively characterized.

[0068] Aligned IMU (Inertial Measurement Unit) and pressure data recorded the mechanical information generated by the pet during movement. IMU data included parameters such as acceleration and angular velocity, reflecting the pet's acceleration and rotational state; pressure data recorded the pressure distribution between the pet and the contact surface. To extract mechanical behavior features from this data, time-domain and frequency-domain analysis methods were employed. In the time-domain analysis, features such as weight change, pressure center trajectory, and dwell time were extracted by processing the IMU and pressure data. Weight change reflects the dynamic changes in the pet's body weight during movement, the pressure center trajectory describes the change in the center position of the pressure distribution over time, and dwell time indicates the length of time the pet remains at a certain position. In the frequency-domain analysis, a Fourier transform was performed on the pressure data to extract the pressure distribution asymmetry index. The pressure distribution asymmetry index reflects the degree of difference in pressure distribution on both sides of the pet's body. For example, when a pet exhibits abnormal behavior such as limping, the pressure distribution on both sides of the body changes significantly, leading to an increase in the pressure distribution asymmetry index. Therefore, by using weight change, pressure center trajectory, residence time, and pressure distribution asymmetry index as mechanical behavior characteristics, we can accurately characterize the behavior of pets at the mechanical level.

[0069] Aligned audio waveform data recorded various sounds emitted by pets. To extract sound event features related to pet behavior from this data, a pre-trained audio event detection model was used. This model, trained on a large amount of audio data, is capable of recognizing different types of sound events. After inputting the audio waveform data into the pre-trained audio event detection model, the model automatically identifies and outputs sound event labels and their timestamps related to specific behaviors. The sound event labels clearly identify the type of sound emitted by the pet, such as barking, breathing, or teeth grinding; the timestamps record the specific time the sound event occurred. These sound event labels and timestamps can provide important clues for analyzing pet behavior and health status. For example, a pet's painful cries may indicate physical discomfort. Therefore, using sound event labels and their timestamps related to specific behaviors as sound event features can effectively characterize the pet's behavioral state at the sound level.

[0070] Furthermore, the step of inputting the high-dimensional joint feature vector into the attention-based Transformer encoder model for deep feature-level fusion, and outputting a joint feature vector representing the pet's instantaneous state and behavior in the current scene, specifically includes: The high-dimensional joint feature vector is input into the multi-head self-attention module of the Transformer encoder model for computation. The dependency relationship and importance weight between features at different time steps and different modalities are adaptively learned through the attention mechanism to obtain the self-attention output features. The self-attention output features and the high-dimensional joint feature input tensor are residually connected, and then the normalized first intermediate representation features are obtained through layer normalization. The first intermediate representation features are input into the feedforward neural network of the Transformer encoder model for nonlinear transformation to extract deeper cross-modal interaction information and obtain the second intermediate representation features after nonlinear mapping. The second intermediate representation feature is residually concatenated with the first intermediate representation feature, and then processed by layer normalization and pooling layers to obtain a joint feature vector that represents the pet's instantaneous state and behavior in the current scene.

[0071] In this embodiment, the high-dimensional joint feature vector obtained after the preceding steps is input into the multi-head self-attention module of the Transformer encoder model. The multi-head self-attention module is one of the core components of the Transformer model, possessing powerful feature association learning capabilities. Within this module, leveraging the attention mechanism, the model adaptively learns the dependencies and importance weights between features at different time steps and across different modalities. Specifically, the model analyzes the degree of association between each feature in the high-dimensional joint feature vector at different times and modalities, assigning corresponding weights to each feature based on this association. Through this calculation, the model can highlight the features that are more critical to representing the pet's state and behavior, thus obtaining the self-attention output features. This process automatically focuses on important information while ignoring some irrelevant details, enabling the model to more accurately capture the key features of the pet's state and behavior.

[0072] After obtaining the self-attention output features, a residual connection is made between them and the tensor corresponding to the high-dimensional joint feature vector of the initial input. Residual connections are an effective technique for solving the vanishing gradient problem during deep neural network training. By adding the self-attention output features to the original input features, the model can retain some of the original information, preventing excessive information loss during multi-layer network training. After the residual connection, layer normalization is performed on the resulting features. Layer normalization standardizes the input data at each layer, making the data distribution more stable at each layer, which helps accelerate the model's training convergence speed and improves the model's performance and stability. After this series of operations, the normalized first intermediate representation features are obtained, which integrates the original features and the feature information learned through self-attention, laying the foundation for further feature processing.

[0073] Next, the first intermediate representation features are input into the feedforward neural network of the Transformer encoder model. The feedforward neural network consists of multiple fully connected layers, capable of performing nonlinear transformations on the input features. In scenarios involving the fusion of pet state and behavioral features, complex interactions exist between features from different modalities, and simple linear transformations are insufficient to fully uncover these potential cross-modal information. The feedforward neural network, through the action of a nonlinear activation function, can extract deeper levels of cross-modal interaction information. For example, pet posture features in the visual modality may be related to pressure distribution features in the mechanical modality; the feedforward neural network can discover this relationship through nonlinear transformations and integrate this information into the feature representation. After the nonlinear mapping by the feedforward neural network, the second intermediate representation features are obtained, which contain richer and more abstract pet state and behavioral information.

[0074] After obtaining the second intermediate representation features, a residual connection is made between it and the first intermediate representation features. This step also aims to preserve previously learned feature information and prevent information loss in deep networks. After the residual connection, layer normalization is performed again to further stabilize the feature distribution and improve the model's training performance. Finally, the layer-normalized features are input into the pooling layer. The pooling layer reduces the dimensionality of the features, decreasing computational cost, while extracting the main information. Through pooling, the model can remove redundant information and retain the most representative features, thus obtaining a joint feature vector representing the pet's instantaneous state and behavior in the current scenario. This joint feature vector integrates feature information from different modalities and time steps, comprehensively and accurately reflecting the pet's state and behavior in the current scenario, providing strong support for subsequent abnormal behavior identification and health risk warning.

[0075] In some embodiments, step S103 above, which involves establishing a personalized behavior baseline model for each pet and comparing the joint feature vector with the personalized behavior baseline model to identify abnormal behavior patterns that deviate from the norm, specifically includes: Collect multiple sets of historical joint feature vectors of the target pet in a healthy state to construct a training dataset to represent the individual's normal behavior; The training dataset is input into the variational autoencoder for unsupervised training. By minimizing the joint loss function consisting of reconstruction loss and KL divergence, the probability distribution of normal behavior in the latent space is learned, and a personalized behavior baseline model is obtained. The statistical threshold of the error is calculated based on the reconstruction error distribution of the training dataset in the trained personalized behavior baseline model. The joint feature vector output in real time is input into the personalized behavior baseline model to calculate the real-time reconstruction error; The real-time reconstruction error is compared with a statistical threshold. When the real-time reconstruction error exceeds the statistical threshold multiple times, the current behavior pattern is identified and marked as an abnormal behavior pattern that deviates from the normal behavior pattern of the target pet individual.

[0076] In this embodiment, within the pet's daily activity area, a multi-sensor collaborative sensing network deployed in the early stages continuously collects various behavioral data of the pet during its healthy period. This data is then processed into joint feature vectors using the aforementioned method. These joint feature vectors encompass various behavioral information of the pet at different times and in different scenarios, such as the pet's posture, activity frequency, and location. Multiple sets of such joint feature vectors are then organized and labeled to construct a training dataset specifically designed to characterize the normal behavior of this individual pet. This training dataset records various behavioral characteristic patterns of the target pet in its healthy state.

[0077] The constructed training dataset is input into the variational autoencoder for unsupervised training. A variational autoencoder is a powerful deep learning model that automatically learns latent features and distribution patterns in data. During training, the model continuously adjusts its parameters, optimizing by minimizing a joint loss function consisting of reconstruction loss and KL divergence. Reconstruction loss measures the difference between the encoded and decoded input data and the original input data, while KL divergence measures the difference between the latent space distribution learned by the model and the prior distribution. Through this optimization of the joint loss function, the model gradually learns the probability distribution of the target pet's normal behavior in the latent space, ultimately obtaining a personalized behavioral baseline model specifically for that pet. This model can determine whether the behavior conforms to normal behavior based on the input feature vector.

[0078] The statistical threshold for error is calculated based on the distribution of reconstruction errors in the trained personalized behavior baseline model using the training dataset. During training, the model encodes and decodes each set of joint feature vectors in the training dataset, resulting in corresponding reconstruction errors. These reconstruction errors reflect the model's ability to reconstruct pet health behavior characteristics; different feature vectors will produce reconstruction errors of different magnitudes. By statistically analyzing these reconstruction errors, such as calculating their mean and standard deviation, and combining this with certain statistical methods and empirical judgment, a suitable statistical threshold for error is determined.

[0079] During daily pet behavior monitoring, a joint feature vector, processed by a sensor-coordinated perception network, is acquired in real time and input into a pre-trained personalized behavior baseline model. The model encodes and decodes the input joint feature vector, which also generates a reconstruction error. This real-time reconstruction error reflects the degree of difference between the current pet's behavioral characteristics and the healthy behavioral characteristics learned by the model. A small real-time reconstruction error indicates that the current pet's behavior is similar to normal behavior in a healthy state; conversely, a large real-time reconstruction error may suggest abnormal behavior in the pet.

[0080] The calculated real-time reconstruction error is compared with a previously determined statistical threshold. Because pet behavior can be somewhat random and subject to short-term fluctuations, to avoid misjudgment, abnormal behavior is not determined solely by a single instance of the real-time reconstruction error exceeding the threshold. Instead, the current behavioral pattern is identified and marked as an abnormal behavior pattern deviating from the target pet's normal behavior only when the real-time reconstruction error exceeds the statistical threshold multiple times consecutively. For example, three or five consecutive instances of the real-time reconstruction error exceeding the threshold can be set as the criteria for determining abnormality. This approach allows for more accurate capture of behavioral patterns that truly deviate from the pet's healthy normality, providing a reliable basis for subsequent health risk warnings.

[0081] Furthermore, the step of inputting the training dataset into the variational autoencoder for unsupervised training, and learning the probability distribution of normal behavior in the latent space by minimizing the joint loss function composed of reconstruction loss and KL divergence, to obtain a personalized behavior baseline model, specifically includes: Each joint feature vector in the training dataset is used as an input sample and sequentially fed into the encoder network of the variational autoencoder for forward propagation. Through multiple nonlinear transformations, the mean vector and log-variance vector of the latent variable distribution are mapped to obtain the mean vector and log-variance vector. Based on the mean vector and the log-variance vector, a reparameterization technique is used to perform random sampling to obtain low-dimensional latent variables. The low-dimensional latent variables are input into the decoder network of the variational autoencoder for reconstruction. Through multiple nonlinear transformations, the reconstructed feature vector with the same dimension as the input sample is output. Calculate the mean squared error between the input sample and the reconstructed feature vector, and use it as the reconstruction loss; Calculate the KL divergence between the Gaussian distribution defined by the mean vector and the log-variance vector and the standard normal distribution, as the regularization loss; The reconstruction loss is weighted and summed with the KL divergence to form a joint loss function. The network parameters of the encoder network and decoder network are iteratively updated through the backpropagation algorithm to minimize the value of the joint loss function. The trained encoder network and decoder network constitute a personalized behavior baseline model. The encoder network is used to map the input fused feature vector to the latent space, and the decoder network is used to reconstruct the feature vector corresponding to the normal behavior of the target pet individual from the latent space.

[0082] In this embodiment, each joint feature vector in the training dataset is treated as an independent input sample. These input samples are sequentially fed into the encoder network of a variational autoencoder. Once the input samples enter the encoder network, they undergo a series of complex calculations and transformations. During this process, the network gradually extracts key information from the samples and maps it to the mean vector and log-variance vector of the latent variable distribution. These two vectors contain the latent feature information of the input samples in the latent space; the mean vector represents the central position of the latent variable distribution, while the log-variance vector reflects the dispersion of the latent variable distribution.

[0083] After obtaining the mean vector and log-variance vector, random sampling using the reparameterization technique is needed to obtain the low-dimensional latent variable. The reparameterization technique is a clever method that allows random sampling while preserving gradient propagation. Specifically, based on the Gaussian distribution defined by the mean and log-variance vectors, a random noise is introduced and combined with the mean and log-variance vectors in a specific way to obtain the low-dimensional latent variable. This low-dimensional latent variable is a representation of the input sample in the latent space; it contains the core feature information of the input sample, and because it is obtained through random sampling, it also possesses a certain degree of randomness and diversity.

[0084] The low-dimensional latent variables obtained through sampling are input into the decoder network of the variational autoencoder. After entering the decoder network, the low-dimensional latent variables undergo a series of inverse calculations and transformations to gradually reconstruct a reconstructed feature vector with the same dimension as the input sample. This reconstructed feature vector is the feature representation of the input sample as understood by the decoder network based on the latent variables, and it attempts to be as close as possible to the original input sample.

[0085] To measure the difference between the feature vector reconstructed by the decoder network and the original input sample, we need to calculate the mean squared error (MSE) between them and use it as the reconstruction loss. MSE is a commonly used metric to measure the difference between two vectors; it reflects their similarity by calculating the sum of squared differences between corresponding elements. A smaller reconstruction loss indicates that the feature vector reconstructed by the decoder network is closer to the original input sample, meaning the model has a stronger ability to extract and reconstruct features from the input sample.

[0086] In addition to the reconstruction loss, the KL divergence between the Gaussian distribution defined by the mean vector and the log-variance vector and the standard normal distribution needs to be calculated and used as the regularization loss. KL divergence is a metric used to measure the difference between two probability distributions; it reflects the degree of deviation between the Gaussian distribution defined by the mean vector and the log-variance vector and the standard normal distribution. In variational autoencoders, introducing KL divergence as a regularization term can prevent the model from overfitting the training data and encourage the model to learn a more generalizable latent space distribution.

[0087] The calculated reconstruction loss is weighted and summed with the KL divergence to form the joint loss function. This joint loss function comprehensively considers the model's reconstruction capability and the rationality of the latent space distribution. Then, using the backpropagation algorithm, the value of the joint loss function is propagated along the backward path of the network, calculating the gradient of each network parameter. Based on this gradient information, the network parameters of the encoder and decoder networks are iteratively updated using an optimization algorithm (such as stochastic gradient descent). During the iteration process, the network parameters are continuously adjusted so that the value of the joint loss function gradually decreases until it reaches a relatively stable minimum value. Through this optimization process, the model can learn the optimal network parameters, thereby better encoding and decoding input samples.

[0088] Once training is complete, a personalized behavior baseline model is constructed using the trained encoder and decoder networks. This model has specific functions: the encoder network maps the input fused feature vector (i.e., the joint feature vector acquired in real time) to the latent space, extracting its core feature information; the decoder network reconstructs the feature vector corresponding to the target pet's normal behavior from the latent space. In this way, the personalized behavior baseline model can learn the probability distribution of the target pet's normal behavior in a healthy state in the latent space, providing an accurate reference standard for subsequent identification of abnormal behavior patterns.

[0089] Furthermore, the calculation of the statistical threshold of the error based on the reconstruction error distribution in the trained personalized behavior baseline model using the training dataset specifically includes: Each joint feature vector of the training dataset is sequentially input into the trained personalized behavior baseline model. The reconstruction error value between each joint feature vector and its corresponding reconstructed feature vector is calculated through forward propagation to obtain an error set containing the reconstruction errors of all normal samples. Perform statistical analysis on the error set, calculate the arithmetic mean of all reconstructed error values ​​as the mean statistic, and calculate the standard deviation statistic of all reconstructed error values ​​relative to the mean statistic. Based on the mean statistic and standard deviation statistic, combined with the preset confidence interval multiple, a statistical threshold for reconstruction error used to distinguish between normal and abnormal conditions is calculated.

[0090] In this embodiment, each joint feature vector in the training dataset is sequentially input into the already trained personalized behavior baseline model. Within the personalized behavior baseline model, the joint feature vectors undergo a forward propagation process. The encoder portion of the model encodes the input joint feature vectors, mapping them to the latent space and extracting key feature information. Then, the decoder portion uses the information in the latent space to reconstruct a reconstructed feature vector with the same dimension as the input joint feature vector. Subsequently, the reconstruction error value between each input joint feature vector and its corresponding reconstructed feature vector is calculated. This reconstruction error value reflects the model's reconstruction effect on the joint feature vectors; the smaller the error value, the more accurate the model's reconstruction of normal behavior. This process is repeated for all joint feature vectors in the training dataset, ultimately obtaining an error set containing the reconstruction errors of all normal samples. This error set records the errors generated when the model reconstructs behavioral features of the target pet in a healthy state.

[0091] After obtaining the error set, in-depth statistical analysis is required. First, the arithmetic mean of all reconstruction error values ​​is calculated and used as the mean statistic. This mean statistic represents the average level of error when the model reconstructs behavioral characteristics of the target pet in a healthy state, reflecting the central tendency of errors under normal conditions. Then, the standard deviation statistic of all reconstruction error values ​​relative to the mean statistic is calculated. The standard deviation statistic measures the dispersion of reconstruction error values ​​relative to the mean, reflecting the fluctuation of errors. By calculating the mean and standard deviation statistics, a more comprehensive understanding of the reconstruction error distribution of the target pet in a healthy state can be obtained, providing important reference for subsequently determining statistical thresholds.

[0092] Based on the previously calculated mean and standard deviation statistics, and combined with a pre-set confidence interval multiple, a statistical threshold for the reconstruction error used to distinguish between normal and abnormal behavior is calculated. The confidence interval multiple is a parameter set according to actual needs and experience, reflecting the tolerance for error range. For example, if a higher confidence level is desired to distinguish between normal and abnormal behavior, a larger confidence interval multiple can be set. Specifically, an upper limit is obtained by adding the product of the confidence interval multiple and the standard deviation statistics to the mean statistic; or a lower limit is obtained by subtracting the product of the confidence interval multiple and the standard deviation statistics from the mean statistic (in practical applications, the upper limit is usually more important because the goal is to identify abnormal behavior that exceeds the normal range). This upper limit is the calculated statistical threshold. When the subsequently calculated real-time reconstruction error exceeds this statistical threshold, it may indicate that the pet's behavior is abnormal. The statistical threshold determined in this way fully considers the error distribution of the target pet in a healthy state, enabling a more accurate distinction between normal and abnormal behavior and providing a reliable judgment standard for pet health monitoring.

[0093] In some embodiments, in step S104 above, the process of matching the identified abnormal behavior patterns with a preset abnormal pattern classifier to perform arthritis risk warning analysis, urinary disease risk warning analysis, and stress or pain warning analysis, and outputting proactive warning information for specific health risks, specifically includes: Based on abnormal behavioral patterns, identify risk characteristics for arthritis, urinary tract diseases, and stress or pain. The arthritis risk characteristics include hind limb key point coordinate sequence, landing impact peak, pressure center trajectory and pressure distribution asymmetry index; the urinary disease risk characteristics include total weight change sequence, effective excretion event markers, pelvic entry and exit events, pelvic posture and specific sound event labels; and the stress or pain characteristics include the difference between local temperature and average body temperature, hiding time, grooming behavior frequency, posture curling degree, vocal frequency and tone characteristics. The risk characteristics of arthritis, urinary tract disease, and stress or pain are respectively input into the corresponding analysis module of the preset abnormal pattern classifier, and arthritis risk warning information, urinary tract disease risk warning information, and stress or pain warning information are output in parallel.

[0094] In this embodiment, after identifying abnormal behavioral patterns, it is necessary to determine the associated arthritis risk characteristics, urinary tract disease risk characteristics, and stress or pain characteristics based on these patterns.

[0095] Regarding the risk characteristics of arthritis, considering that the movement and force distribution of the hind limbs will change when a pet has arthritis, the coordinate sequence of key points of the hind limbs is selected to record the positional changes of each key point of the hind limb during movement. This can intuitively reflect the movement trajectory and posture of the hind limbs. The peak impact on landing reflects the magnitude of the impact force that the pet experiences when the hind limbs land while walking or running. Arthritis may cause abnormal impact force when the pet lands. The center of pressure trajectory reflects the movement path of the pet's center of gravity during movement, while the pressure distribution asymmetry index can measure the degree of balance of pressure on the pet's limbs. Both of these are very helpful in judging the uneven force distribution on the limbs caused by arthritis.

[0096] Regarding urinary tract disease risk characteristics, the total weight change sequence can reflect the fluctuation of the pet's overall body weight. Urinary tract diseases may cause changes in the pet's weight due to abnormal water metabolism, etc. Effective excretion event markers are used to mark whether the pet has completed normal excretion behavior. Urinary tract diseases may affect the pet's excretion function. The entry and exit from the litter box event records the number and time of the pet entering and exiting the litter box, which helps to understand whether the pet's excretion habits have changed. The litter box posture can observe the pet's posture during excretion. Some urinary tract diseases may cause abnormal excretion postures in pets. Specific sound event tags mark the special sounds made by the pet during excretion. Some urinary tract diseases may be accompanied by specific sound manifestations.

[0097] Among the characteristics of stress or pain, the difference between local temperature and average body temperature reflects the temperature changes in a pet's local area. When a pet is stressed or in pain, local blood circulation may change, leading to abnormal temperature. Hiding time reflects how long a pet seeks a hiding place when faced with external stimuli or physical discomfort. Stress or pain may make a pet more inclined to hide. Grooming behavior frequency records the number of times a pet grooms its fur. Stress or pain may affect a pet's normal grooming behavior. Postural curling describes the degree to which a pet curls its body. When feeling uncomfortable, a pet may curl its body. Vocal frequency and tone characteristics reflect a pet's vocalization. When stressed or in pain, a pet's vocalizations may change in frequency and tone.

[0098] After identifying the aforementioned risk characteristics, the arthritis risk characteristics are input into a pre-defined abnormal pattern classifier module specifically designed for arthritis risk analysis. This module analyzes and judges the input arthritis risk characteristics based on pre-set rules and models. For example, it compares the hind limb key point coordinate sequences with normal conditions and analyzes whether the impact peak exceeds the normal range. Ultimately, it outputs arthritis risk warning information in parallel, informing pet owners that their pet may have arthritis and the approximate degree of risk. Simultaneously, the urinary tract disease risk characteristics are input into the corresponding urinary tract disease risk analysis module within the abnormal pattern classifier. This module comprehensively analyzes features such as total weight change sequences and effective excretion event markers to determine whether the pet exhibits symptoms related to urinary tract diseases. It then outputs urinary tract disease risk warning information, allowing pet owners to understand potential problems with their pet's urinary system.

[0099] In addition, stress or pain characteristics are input into the module responsible for stress or pain analysis within the abnormal pattern classifier. This module assesses whether the pet is under stress or experiencing pain based on characteristics such as the difference between local and average body temperature, and the duration of hiding, and outputs stress or pain warning information to remind pet owners to pay attention to their pet's psychological and physical condition. In this way, comprehensive and accurate early warning of pet health risks can be provided, offering strong support for pet health management.

[0100] Furthermore, the steps of inputting arthritis risk characteristics, urinary tract disease risk characteristics, and stress or pain characteristics into the corresponding analysis modules of a preset abnormal pattern classifier, and outputting arthritis risk warning information, urinary tract disease risk warning information, and stress or pain warning information in parallel, specifically include: The arthritis risk features are input into a pre-trained arthritis risk classifier for forward computation, and the output includes arthritis risk warning information including arthritis risk probability value, jump height reduction ratio, landing posture asymmetry index and hindlimb weight-bearing difference description. The risk features of urinary tract diseases are input into a pre-trained urinary tract disease risk classifier for time-series feature modeling, and the output includes urinary tract disease risk warning information including abnormal toilet behavior pattern scores, invalid accesses per unit time, longest squatting time per session, and pain posture confidence description. Stress or pain features are constructed into a spatiotemporal graph structure containing node features and edge connections in chronological order. This data is then input into a pre-trained graph neural network model for graph-level classification calculation. The output includes stress or pain probability values, location of local high-temperature areas, increase in frequency of abnormal grooming behavior, and percentage decrease in activity level as stress or pain warning information.

[0101] In this embodiment, arthritis risk features are input into a pre-trained arthritis risk classifier. During the input process, the classifier performs forward computation on these features. Forward computation involves the classifier performing a series of calculations and processing on the input features based on pre-learned patterns and rules. For example, for the hind limb key point coordinate sequence, the classifier analyzes its positional changes at different time points to determine if there are any abnormal movement trajectories; for the landing impact peak, it compares it with the normal range. After these calculations, the classifier outputs arthritis risk warning information containing multiple key pieces of information. Among them, the arthritis risk probability value indicates the likelihood of the pet developing arthritis, with a higher value indicating a greater risk; the jump height reduction ratio reflects the change in the pet's height during jumps, and an abnormally large increase in this ratio may indicate problems with the hind limb joints; the landing posture asymmetry index measures the difference in posture between the left and right hind limbs when the pet lands, and a large difference may indicate uneven stress on the joints; the hind limb weight-bearing difference description details the different weights the pet's hind limbs bear when standing or moving, helping to further determine the degree of impact of arthritis on the hind limbs. Through this information, pet owners can comprehensively understand the risk status of their pet's arthritis.

[0102] For urinary tract disease risk features, these are input into a pre-trained urinary tract disease risk classifier. This classifier processes the input features using temporal feature modeling. Temporal feature modeling takes into account how these features change over time, such as fluctuations in total weight changes at different time points and the temporal distribution of urinary tract entry and exit events. By analyzing these temporal features, the classifier can identify abnormal patterns in the pet's urinary system. Finally, the classifier outputs a urinary tract disease risk warning, including an abnormality score for toilet behavior patterns, the number of invalid visits per unit time, the longest single squatting time, and a confidence level for painful postures. The toilet behavior pattern abnormality score reflects the degree to which a pet's excretion behavior deviates from the normal pattern; a higher score indicates a more pronounced abnormality. The number of invalid visits per unit time indicates the number of times a pet enters the litter box but fails to excrete effectively within a given period; an excessive number of invalid visits may suggest a urinary system problem. The longest single squatting time records the longest time a pet has squatted in the litter box; an excessively long squatting time may indicate difficulty excreting. The pain posture confidence score determines whether the pet exhibits a painful posture during excretion; a higher confidence score indicates a greater likelihood of a painful posture. This information can help pet owners detect potential urinary system diseases in their pets in a timely manner.

[0103] When processing stress or pain features, these features are first constructed into a spatiotemporal graph structure containing node features and edge connections in chronological order. Node features can be understood as various stress or pain-related characteristics of the pet at each time point, such as local temperature and grooming behavior frequency; edge connections represent the correlation between features at different time points. The constructed spatiotemporal graph structure data is input into a pre-trained graph neural network model, which performs graph-level classification calculations on the nodes and edges in the graph. It analyzes the interrelationships between node features and the structural features of the entire graph to determine whether the pet is experiencing stress or pain. Finally, the model outputs stress or pain warning information including stress or pain probability values, the location of local high-temperature areas, the increase in the frequency of abnormal grooming behavior, and the percentage decrease in activity level. The stress or pain probability value indicates the likelihood that the pet is under stress or in pain; the location of the localized high-temperature area indicates the specific location of abnormally elevated body temperature; the increase in the frequency of abnormal grooming reflects the extent to which the pet's grooming frequency has increased compared to normal, and excessive abnormal grooming may be related to stress or pain; the percentage decrease in activity level indicates the proportion of the pet's activity level reduced relative to normal, and a significant decrease in activity level may be a sign of discomfort in the pet. With this information, pet owners can take timely measures to alleviate their pet's stress or pain.

[0104] Furthermore, the arthritis risk features are input into a pre-trained arthritis risk classifier for forward computation, and the output includes arthritis risk warning information describing arthritis risk probability values, jump height reduction ratio, landing posture asymmetry index, and hindlimb weight-bearing differences, specifically including: Arthritis risk features are input into a pre-trained arthritis risk classifier for forward computation. Through nonlinear mapping of a multi-layer fully connected network, the probability value of arthritis risk that the current behavior pattern belongs to the arthritis risk category is output. When the arthritis risk probability value exceeds the preset arthritis risk probability threshold, the jump height reduction ratio, landing posture symmetry index and hind limb weight-bearing difference coefficient are calculated based on the arthritis risk characteristics. Combined with the normal range reference values ​​stored in the personalized behavioral baseline model, arthritis risk warning information is generated.

[0105] In this embodiment, the acquired arthritis risk features, such as the hind limb keypoint coordinate sequence, impact peak, pressure center trajectory, and pressure distribution asymmetry index, are input into a pre-trained arthritis risk classifier. This classifier internally constructs a multi-layer fully connected network. When the arthritis risk feature data enters the network, it undergoes a non-linear mapping process through these multi-layer fully connected networks. Each layer of the fully connected network performs specific transformations and processing on the input data, introducing non-linear factors through activation functions, enabling the network to learn complex patterns and relationships within the data. After multiple layers of such processing, the network ultimately outputs a numerical value representing the probability that the current input behavior pattern belongs to the arthritis risk category. This probability value typically ranges from 0 to 1; the closer the value is to 1, the greater the likelihood that the pet's current behavior pattern belongs to the arthritis risk category; the closer the value is to 0, the lower the likelihood.

[0106] After obtaining the arthritis risk probability value, the system compares it with a preset arthritis risk probability threshold. This preset threshold, determined through extensive experiments and data analysis, serves as a critical value for judging whether a pet has an arthritis risk. If the arthritis risk probability value exceeds the preset threshold, it indicates a high probability that the pet has arthritis, and the system will then further calculate relevant indicators.

[0107] Specifically, the system calculates the percentage decrease in jump height based on the input arthritis risk characteristics. By comparing the pet's jump height data in its normal state and current state, the system analyzes the changes in jump height to determine the percentage decrease. For example, by recording the pet's average jump height under normal conditions over a period of time, and then recording the average jump height under current conditions, the ratio of the difference between the two to the normal average height is the percentage decrease in jump height.

[0108] Simultaneously, the system calculates a landing posture asymmetry index. Using data such as the coordinate sequence of key hind limb points, it analyzes the postural differences between the left and right hind limbs upon landing, quantifying the degree of this difference through a specific algorithm to obtain the landing posture asymmetry index. This index directly reflects the force balance of the pet's hind limbs upon landing; a higher index indicates more pronounced asymmetry, potentially suggesting joint problems.

[0109] In addition, the system also calculates the hind limb weight-bearing difference coefficient. Combining data such as the pressure center trajectory and pressure distribution asymmetry index, it analyzes the differences in weight bearing on the pet's hind limbs during standing or movement, and calculates the hind limb weight-bearing difference coefficient. This coefficient reflects whether the force distribution on the pet's hind limb joints is uniform, and is of significant reference value for assessing the impact of arthritis on the hind limbs.

[0110] After calculating the above indicators, the system combines them with the normal range reference values ​​stored in the personalized behavioral baseline model established for each pet. The personalized behavioral baseline model records the normal range of various behavioral indicators for a pet in a healthy state. By comparing and analyzing the calculated jump height reduction ratio, landing posture asymmetry index, and hind limb weight-bearing difference coefficient with the normal range reference values, the system can more accurately determine the specific risk of arthritis in pets. Finally, based on these analysis results, the system generates arthritis risk warning information that includes arthritis risk probability values, jump height reduction ratio, landing posture asymmetry index, and hind limb weight-bearing difference descriptions, and promptly provides feedback to the pet owner so that they can take appropriate measures.

[0111] Furthermore, the step of inputting urinary tract disease risk features into a pre-trained urinary tract disease risk classifier for temporal feature modeling, and outputting urinary tract disease risk warning information including abnormal toilet behavior pattern scores, invalid accesses per unit time, longest single squatting time, and pain posture confidence descriptions, specifically including: The risk features of urinary tract diseases are input into a pre-trained urinary tract disease risk classifier for temporal feature modeling. The information flow is controlled by forget gate, input gate and output gate. The hidden layer state sequence is extracted and the hidden state of the last time step or the pooling result of the hidden state of all time steps is input into the fully connected layer. The output is a toilet behavior pattern abnormality score used to characterize the degree of abnormality of the current toilet behavior pattern. Based on the risk characteristics of urinary tract diseases, determine the number of invalid visits, the longest single squatting duration, and the confidence score of painful posture for the target pet within a unit time window, and compare them logically with the rule thresholds stored in the preset urinary tract disease quantitative rule base to obtain the logical comparison results. When the abnormal score of toilet behavior pattern exceeds the preset abnormal score threshold for urinary tract diseases, and the logical comparison result shows that at least one rule in the urinary tract disease quantitative rule base is satisfied, a urinary tract disease risk warning information is generated.

[0112] In this embodiment, collected urinary tract disease risk features, such as total weight change sequences, effective excretion event markers, pelvic entry and exit events, pelvic posture, and specific sound event labels, are input into a pre-trained urinary tract disease risk classifier. This classifier employs an architecture with a special information flow control mechanism for temporal feature modeling, where the forget gate, input gate, and output gate play crucial roles. The forget gate determines which information to discard from previous states, the input gate controls the inflow of new information, and the output gate determines which information to output. Through the coordinated work of these three gates, the flow of information within the model can be precisely controlled, thereby effectively extracting the hidden layer state sequence. The hidden layer state sequence contains key information about urinary tract disease risk features at different time points. Then, the hidden state of the last time step or the pooling result of the hidden states of all time steps is input into a fully connected layer. The fully connected layer comprehensively analyzes and processes these inputs, ultimately outputting a numerical value, which is the toilet behavior pattern abnormality score used to characterize the degree of abnormality in the current toilet behavior pattern. The score range is usually set within a reasonable range. The higher the score, the greater the deviation of the pet's current toilet behavior pattern from the normal state, and the higher the possibility of urinary tract disease.

[0113] While obtaining an abnormal score for toilet behavior patterns, the system further determines relevant indicators for the target pet within a unit time window based on the input urinary tract disease risk characteristics. Specifically, it counts the number of invalid visits by the target pet within the unit time window, i.e., the number of times the pet enters the litter box but fails to complete effective excretion; it records the longest single squatting duration, i.e., the longest time the pet squats in the litter box at any one time; and it assesses the pet's painful posture during excretion, deriving a painful posture confidence score, which reflects the likelihood of the pet exhibiting a painful posture. Then, these indicators are logically compared one by one with the rule thresholds stored in a pre-set urinary tract disease quantitative rule base. The urinary tract disease quantitative rule base is derived from extensive experiments and data analysis, and includes various indicator thresholds and judgment rules related to urinary tract diseases. Through logical comparison, it is possible to determine whether the target pet's various indicators exceed the normal range, thereby obtaining the logical comparison results.

[0114] After completing the above steps, the system will comprehensively assess the abnormal toilet behavior pattern score and the logical comparison results. If the abnormal toilet behavior pattern score exceeds a preset threshold for abnormal urinary tract diseases—a scientifically set threshold used to distinguish between normal and abnormal toilet behavior patterns—and the logical comparison results show that at least one rule in the urinary tract disease quantitative rule base is met, this indicates that the target pet has a high risk of urinary tract disease. At this point, the system will generate detailed urinary tract disease risk warning information based on specific information such as the abnormal toilet behavior pattern score, the number of invalid visits per unit time, the longest single squatting time, and the confidence level of painful postures. This warning information will clearly indicate the potential urinary tract disease risk in the pet, providing a basis for pet owners to take timely and appropriate measures to detect and treat urinary tract diseases as early as possible.

[0115] Furthermore, the stress or pain features are constructed into a spatiotemporal graph structure data containing node features and edge connections in chronological order, and input into a pre-trained graph neural network model for graph-level classification calculation. The output includes stress or pain warning information containing stress or pain probability values, location of local high-temperature areas, increase in frequency of abnormal grooming behavior, and percentage decrease in activity level. Specifically, this includes: The stress or pain characteristics are segmented according to the preset time window length, and the data within each time window is further divided into multiple consecutive time segments. A graph node is constructed for each time segment to form a graph node set. Based on the temporal order of time segments within the same time window, a time edge is constructed to connect adjacent time segment nodes. The time edge is used to represent the evolutionary dependency of behavioral features in the time dimension. Based on the physiological correlation and statistical correlation between stress or pain features of different modalities within the same time segment, a spatial edge is constructed to connect the corresponding sub-nodes of different modalities within the same time segment. The spatial edge is used to characterize the feature coupling relationship of stress or pain features of different modalities at the same time point. The set of edges, which together consist of temporal and spatial edges, is combined with the set of graph nodes to form a spatiotemporal graph structure data. The spatiotemporal graph structure data is input into a pre-trained graph neural network model for graph-level classification calculation. The graph convolutional layer performs message passing and feature aggregation along the temporal and spatial edges, updates node features layer by layer to fuse neighborhood information, and outputs stress or pain probability values. Key correlation features were extracted from the spatiotemporal graph structure data to determine the location of local high-temperature areas, the increase in the frequency of abnormal grooming behavior, and the percentage decrease in activity levels. When the stress or pain probability value exceeds the preset stress or pain probability threshold, and the location of the local temperature difference area, the increase in the frequency of abnormal grooming behavior, and the amount of activity are all higher than their respective confidence thresholds, stress or pain warning information is generated.

[0116] In this embodiment, the collected stress or pain characteristic data is segmented according to a pre-set time window length. For example, if the time window length is set to 10 minutes, the continuous stress or pain characteristic data is divided into units of 10 minutes each. Then, the data within each time window is further subdivided into multiple consecutive time segments, such as dividing the 10-minute time window into 10 one-minute time segments. Next, a graph node is constructed for each time segment. When constructing the graph node, stress or pain characteristics within that time segment, such as the difference between local and average body temperature, hiding time, frequency of grooming behavior, curling posture, vocalization frequency, and tone characteristics, are used as the node's feature information. In this way, a set of graph nodes is formed, with each node carrying stress or pain-related characteristic information of the pet within a specific time segment.

[0117] The edge set includes temporal edges and spatial edges. Temporal edges are constructed by connecting nodes of adjacent time segments based on their chronological order within the same time window. For example, within a 10-minute time window, nodes of the first minute are connected to nodes of the second minute, which in turn are connected to nodes of the third minute, and so on. These temporal edges can characterize the evolutionary dependencies of pet behavioral traits over time, reflecting changes in a pet's stress or pain state over time.

[0118] For the construction of spatial edges, based on the physiological correlation and statistical correlation between stress or pain characteristics of different modalities within the same time segment, the child nodes corresponding to different modalities within the same time segment are connected. For example, within a time segment, local temperature characteristics and grooming behavior frequency characteristics may have certain physiological connections, so their corresponding child nodes are connected. Spatial edges can characterize the feature coupling relationship of stress or pain characteristics of different modalities at the same time point, helping to more comprehensively understand the stress or pain state of pets at that time point. Combining the edge set formed by the time edges and spatial edges with the previously constructed graph node set forms a complete spatiotemporal graph structure data.

[0119] The constructed spatiotemporal graph structure data is input into a pre-trained graph neural network model. In this model, graph convolutional layers perform message passing and feature aggregation operations along temporal and spatial edges. Specifically, each node receives information from neighboring nodes (connected by temporal and spatial edges) and fuses this information with its own features. By updating node features layer by layer, neighborhood information is fully integrated, ensuring that each node's features include not only its own information but also information from other related nodes. After multiple layers of this operation, the model ultimately outputs a stress or pain probability value, reflecting the likelihood that the pet is currently in a state of stress or pain.

[0120] Key correlation features were further extracted from the spatiotemporal graph structure data. For the location of local high-temperature areas, by analyzing the temperature difference between the local temperature and the average body temperature in the nodes, areas with temperatures significantly higher than the average body temperature were identified, and their locations on the pet's body were determined. For the increase in the frequency of abnormal grooming behavior, the changes in the frequency of grooming behavior in different time segments were compared, and the increase in the frequency of abnormal grooming behavior relative to the normal state was calculated. For the percentage decrease in activity level, based on the pet's activity-related characteristics recorded in the nodes, such as hiding time and curled-up posture, the changes in the pet's activity level were comprehensively assessed, and the percentage decrease in activity level was calculated.

[0121] When the stress or pain probability value output by the graph neural network model exceeds a preset stress or pain probability threshold, it indicates a high probability that the pet is in a state of stress or pain. Simultaneously, the system checks whether the location of the local temperature difference zone, the increase in the frequency of abnormal grooming behavior, and the activity level are all above their respective confidence thresholds. If both conditions are met—that is, the stress or pain probability value exceeds the threshold, and the location of the local temperature difference zone, the increase in the frequency of abnormal grooming behavior, and the activity level are all above their respective thresholds—it indicates that the pet is very likely in a state of stress or pain. At this point, the system will generate a stress or pain warning message to remind the pet owner to pay attention to the pet's health.

[0122] Reference Figure 2 An embodiment of the present invention provides a multimodal pet living area behavior detection and analysis system 2, the system specifically comprising: The data acquisition module 201 is used to build a collaborative sensing network through multiple sensors deployed in the pet's exercise area to acquire multimodal heterogeneous raw sensor data; The data fusion module 202 is used to perform spatiotemporal synchronization and spatial coordinate system calibration on the raw sensor data, and to perform feature extraction and multi-level fusion to obtain a joint feature vector for characterizing the pet's state and behavior. The anomaly detection module 203 is used to establish a personalized behavior baseline model for each pet, compare the joint feature vector with the personalized behavior baseline model, and identify abnormal behavior patterns that deviate from the norm. The early warning analysis module 204 is used to match the identified abnormal behavior patterns with a preset abnormal pattern classifier, and to perform early warning analysis for arthritis risk, urinary disease risk, and stress or pain, respectively, and output proactive early warning information for specific health risks.

[0123] It is understandable that, such as Figure 1 The content of the multimodal pet living area behavior detection and analysis method embodiments shown herein is applicable to the multimodal pet movement area behavior detection and analysis system embodiments. The specific functions implemented by the multimodal pet movement area behavior detection and analysis system embodiments are as follows: Figure 1 The embodiment of the multimodal pet living area behavior detection and analysis method shown is the same, and the beneficial effects achieved are the same as those described above. Figure 1 The beneficial effects achieved by the illustrated embodiment of the multimodal pet living area behavior detection and analysis method are the same.

[0124] It should be noted that the information interaction and execution process between the above systems are based on the same concept as the method embodiments of the present invention. For details on their specific functions and technical effects, please refer to the method embodiments section, which will not be repeated here.

Claims

1. A method for detecting and analyzing pet living area behavior based on multimodal approaches, characterized in that, The method specifically includes: A collaborative sensing network is built by deploying multiple sensors in the pet's exercise area to acquire multimodal heterogeneous raw sensor data; The original sensor data is spatiotemporally synchronized and calibrated with a spatial coordinate system, and features are extracted and multi-level fused to obtain a joint feature vector that characterizes the pet's state and behavior. A personalized behavioral baseline model is established for each pet. The joint feature vector is compared with the personalized behavioral baseline model to identify abnormal behavioral patterns that deviate from the norm. The identified abnormal behavior patterns are matched with a preset abnormal pattern classifier to perform arthritis risk warning analysis, urinary disease risk warning analysis, and stress or pain warning analysis, and output proactive warning information for specific health risks.

2. The method according to claim 1, characterized in that, The process of performing spatiotemporal synchronization and spatial coordinate system calibration on the raw sensor data, followed by feature extraction and multi-level fusion to obtain a joint feature vector characterizing the pet's state and behavior, specifically includes: By applying a unified timestamp reference to the raw sensor data, interpolating or resampling data at different sampling rates, and performing joint spatial calibration on each sensor to establish coordinate transformation relationships, aligned multi-source heterogeneous data is obtained. Based on aligned multi-source heterogeneous data, each data is processed through a corresponding feature extraction model to obtain posture and appearance features, heat source distribution features, target motion and micro-motion features, mechanical behavior features, and sound event features. The posture and appearance features, heat source distribution features, target motion and micro-motion features, mechanical behavior features and sound event features are spliced ​​together in the time dimension to form a high-dimensional joint feature vector. The high-dimensional joint feature vector is input into the attention-based Transformer encoder model for deep feature-level fusion, and the output is a joint feature vector that represents the pet's instantaneous state and behavior in the current scene.

3. The method according to claim 2, characterized in that, The alignment-based multi-source heterogeneous data is processed through corresponding feature extraction models to obtain posture and appearance features, heat source distribution features, target motion and micro-motion features, mechanical behavior features, and sound event features, specifically including: Based on aligned visual data, the bounding box coordinates of the pet are extracted through a pre-trained convolutional neural network object detection model, and the coordinate sequence of multiple key points of the pet's body is extracted through a pose estimation model as pose and appearance features. Based on aligned thermal imaging data, heat source contours are extracted and temperature distribution statistics are performed using image processing algorithms to obtain heat source area, average temperature and local temperature difference features, which are used as heat source distribution features. Based on aligned millimeter-wave radar data, a range-Doppler spectrum is generated through fast Fourier transform, and the target's range, radial velocity, and micro-Doppler spectral features are extracted from the range-Doppler spectrum as target motion and micro-motion characteristics. Based on aligned IMU data and pressure data, weight change value, pressure center trajectory, residence time and pressure distribution asymmetry index are extracted through time domain and frequency domain analysis as mechanical behavior characteristics. Based on aligned audio waveform data, a pre-trained audio event detection model is used to identify and output sound event labels and their timestamps related to specific behaviors, which serve as sound event features.

4. The method according to claim 2, characterized in that, The process of inputting the high-dimensional joint feature vector into a Transformer encoder model based on an attention mechanism for deep feature-level fusion, and outputting a joint feature vector representing the pet's instantaneous state and behavior in the current scene, specifically includes: The high-dimensional joint feature vector is input into the multi-head self-attention module of the Transformer encoder model for computation. The dependency relationship and importance weight between features at different time steps and different modalities are adaptively learned through the attention mechanism to obtain the self-attention output features. The self-attention output features and the high-dimensional joint feature input tensor are residually connected, and then the normalized first intermediate representation features are obtained through layer normalization. The first intermediate representation features are input into the feedforward neural network of the Transformer encoder model for nonlinear transformation to extract deeper cross-modal interaction information and obtain the second intermediate representation features after nonlinear mapping. The second intermediate representation feature is residually concatenated with the first intermediate representation feature, and then processed by layer normalization and pooling layers to obtain a joint feature vector that represents the pet's instantaneous state and behavior in the current scene.

5. The method according to claim 1, characterized in that, The process involves establishing a personalized behavioral baseline model for each pet, comparing the joint feature vector with the personalized behavioral baseline model, and identifying abnormal behavioral patterns that deviate from the norm. Specifically, this includes: Collect multiple sets of historical joint feature vectors of the target pet in a healthy state to construct a training dataset to represent the individual's normal behavior; The training dataset is input into the variational autoencoder for unsupervised training. By minimizing the joint loss function consisting of reconstruction loss and KL divergence, the probability distribution of normal behavior in the latent space is learned, and a personalized behavior baseline model is obtained. The statistical threshold of the error is calculated based on the reconstruction error distribution of the training dataset in the trained personalized behavior baseline model. The joint feature vector output in real time is input into the personalized behavior baseline model to calculate the real-time reconstruction error; The real-time reconstruction error is compared with a statistical threshold. When the real-time reconstruction error exceeds the statistical threshold multiple times, the current behavior pattern is identified and marked as an abnormal behavior pattern that deviates from the normal behavior pattern of the target pet individual.

6. The method according to claim 5, characterized in that, The process involves inputting the training dataset into a variational autoencoder for unsupervised training. By minimizing the joint loss function consisting of the reconstruction loss and the KL divergence, the probability distribution of normal behavior in the latent space is learned, thus obtaining a personalized behavior baseline model. Specifically, this includes: Each joint feature vector in the training dataset is used as an input sample and sequentially fed into the encoder network of the variational autoencoder for forward propagation. Through multiple nonlinear transformations, the mean vector and log-variance vector of the latent variable distribution are mapped to obtain the mean vector and log-variance vector. Based on the mean vector and the log-variance vector, a reparameterization technique is used to perform random sampling to obtain low-dimensional latent variables. The low-dimensional latent variables are input into the decoder network of the variational autoencoder for reconstruction. Through multiple nonlinear transformations, the reconstructed feature vector with the same dimension as the input sample is output. Calculate the mean squared error between the input sample and the reconstructed feature vector, and use it as the reconstruction loss; Calculate the KL divergence between the Gaussian distribution defined by the mean vector and the log-variance vector and the standard normal distribution, as the regularization loss; The reconstruction loss is weighted and summed with the KL divergence to form a joint loss function. The network parameters of the encoder network and decoder network are iteratively updated through the backpropagation algorithm to minimize the value of the joint loss function. The trained encoder network and decoder network constitute a personalized behavior baseline model. The encoder network is used to map the input fused feature vector to the latent space, and the decoder network is used to reconstruct the feature vector corresponding to the normal behavior of the target pet individual from the latent space.

7. The method according to claim 5, characterized in that, The calculation of the statistical threshold of the error based on the reconstruction error distribution in the trained personalized behavior baseline model using the training dataset specifically includes: Each joint feature vector of the training dataset is sequentially input into the trained personalized behavior baseline model. The reconstruction error value between each joint feature vector and its corresponding reconstructed feature vector is calculated through forward propagation to obtain an error set containing the reconstruction errors of all normal samples. Perform statistical analysis on the error set, calculate the arithmetic mean of all reconstructed error values ​​as the mean statistic, and calculate the standard deviation statistic of all reconstructed error values ​​relative to the mean statistic. Based on the mean statistic and standard deviation statistic, combined with the preset confidence interval multiple, a statistical threshold for reconstruction error used to distinguish between normal and abnormal conditions is calculated.

8. The method according to claim 5, characterized in that, The process involves matching identified abnormal behavior patterns with a preset abnormal pattern classifier to perform arthritis risk warning analysis, urinary tract disease risk warning analysis, and stress or pain warning analysis, respectively, and outputting proactive warning information for specific health risks, specifically including: Based on abnormal behavioral patterns, identify risk characteristics for arthritis, urinary tract diseases, and stress or pain. The arthritis risk characteristics include hind limb key point coordinate sequence, landing impact peak, pressure center trajectory and pressure distribution asymmetry index; the urinary disease risk characteristics include total weight change sequence, effective excretion event markers, pelvic entry and exit events, pelvic posture and specific sound event labels; and the stress or pain characteristics include the difference between local temperature and average body temperature, hiding time, grooming behavior frequency, posture curling degree, vocal frequency and tone characteristics. The risk characteristics of arthritis, urinary tract disease, and stress or pain are respectively input into the corresponding analysis module of the preset abnormal pattern classifier, and arthritis risk warning information, urinary tract disease risk warning information, and stress or pain warning information are output in parallel.

9. The method according to claim 8, characterized in that, The process involves inputting arthritis risk characteristics, urinary tract disease risk characteristics, and stress or pain characteristics into the corresponding analysis module of a preset abnormal pattern classifier, and outputting arthritis risk warning information, urinary tract disease risk warning information, and stress or pain warning information in parallel, specifically including: The arthritis risk features are input into a pre-trained arthritis risk classifier for forward computation, and the output includes arthritis risk warning information including arthritis risk probability value, jump height reduction ratio, landing posture asymmetry index and hindlimb weight-bearing difference description. The risk features of urinary tract diseases are input into a pre-trained urinary tract disease risk classifier for time-series feature modeling, and the output includes urinary tract disease risk warning information including abnormal toilet behavior pattern scores, invalid accesses per unit time, longest squatting time per session, and pain posture confidence description. Stress or pain features are constructed into a spatiotemporal graph structure containing node features and edge connections in chronological order. This data is then input into a pre-trained graph neural network model for graph-level classification calculation. The output includes stress or pain probability values, location of local high-temperature areas, increase in frequency of abnormal grooming behavior, and percentage decrease in activity level as stress or pain warning information.

10. A multimodal pet living area behavior detection and analysis system, characterized in that, The system specifically includes: The data acquisition module is used to build a collaborative sensing network through multiple sensors deployed in the pet's exercise area to acquire multimodal heterogeneous raw sensor data; The data fusion module is used to perform spatiotemporal synchronization and spatial coordinate system calibration on the raw sensor data, and to perform feature extraction and multi-level fusion to obtain a joint feature vector that characterizes the pet's state and behavior. The anomaly detection module is used to build a personalized behavior baseline model for each pet, compare the joint feature vector with the personalized behavior baseline model, and identify abnormal behavior patterns that deviate from the norm. The early warning analysis module is used to match the identified abnormal behavior patterns with the preset abnormal pattern classifier, and to perform early warning analysis for arthritis risk, urinary disease risk, and stress or pain, respectively, and output proactive early warning information for specific health risks.