Pet tracking method and pet robot

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By extracting feature vectors through multi-level filtering and recognition networks, and combining motion state initialization and image background drift correction, the tracking drift problems caused by identity confusion and high-speed rotation in pet tracking are solved, thus achieving stability and accuracy in pet tracking.

CN122244096APending Publication Date: 2026-06-19GUANGDONG XINBAO ELECTRICAL APPLIANCES HLDG CO LTD

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: GUANGDONG XINBAO ELECTRICAL APPLIANCES HLDG CO LTD
Filing Date: 2026-03-11
Publication Date: 2026-06-19

Application Information

Patent Timeline

11 Mar 2026

Application

19 Jun 2026

Publication

CN122244096A

IPC: G06T7/246; G06V20/40; G06V10/80; G06T3/02

AI Tagging

Application Domain

Image analysis Geometric image transformation

Technology Topics

Feature vector Computer graphics (images)

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

Smart Images

Figure CN122244096A_ABST

Patent Text Reader

Abstract

This application provides a pet tracking method and a pet robot. The method includes: performing preliminary multi-level target screening on the acquired video stream and extracting normalized feature vectors based on a recognition network; initializing the tracking system to zero speed and generating a constant global identity reference template based on the initialized and locked target position; determining the image background drift caused by gimbal rotation and correcting the tracking center based on the image background drift to resist visual jitter caused by gimbal self-motion; adaptively fusing the motor information of the motor driving the rotation with the visual information of the video stream; establishing a dynamically updated dynamic template, and during target search, weighting and synthesizing the dynamic template and the global identity reference template according to a preset ratio to generate a search template. This improves pet recognition accuracy, reduces visual jitter caused by gimbal self-motion, and further enhances pet recognition accuracy.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of pet tracking methods, and in particular to a pet tracking method and a pet robot. Background Technology

[0002] Current mainstream tracking models based on Siamese networks primarily involve manually or automatically specifying the target pet's position in the initial frame, typically using a detection box. After image preprocessing, the Siamese network extracts its appearance features as templates, and then efficiently performs template matching in subsequent frames to achieve continuous tracking. However, this approach has the following drawbacks: the algorithm relies solely on shallow appearance features such as color and shape. When multiple pets of the same breed and similar appearance appear simultaneously or move in overlapping patterns, the tracker is prone to identity confusion, mistaking target A for target B. If the target is briefly occluded and then reappears, the algorithm may also incorrectly associate it with another similar-looking target, causing tracking "drift." In tracking systems with rotating motors, the tracking algorithm outputs the target position, and the control system then drives the motor to rotate the camera. This is an open-loop or delayed closed-loop system. The prediction model of the tracking algorithm does not consider the camera's own motion. When the motor rotates at high speed, it generates significant motion blur and prediction errors, causing the tracker to fail to keep up with the target or even completely lose track of it. Summary of the Invention

[0003] The purpose of this application is to provide a pet tracking method.

[0004] The embodiments of this application adopt the following technical solution: a pet tracking method, comprising: The acquired video stream is subjected to preliminary multi-level target screening, and normalized feature vectors are extracted based on the recognition network; The tracking system performs zero-speed initialization of motion state and generates a constant global identity reference template based on the target position locked during initialization; Determine the amount of image background drift caused by gimbal rotation, and based on the amount of image background drift, correct the tracking center to counteract the visual jitter caused by gimbal self-motion; The motor information of the motor that drives the rotation is adaptively fused with the visual information of the video stream. A dynamically updated template is created. When performing a target search, the dynamic template is weighted and combined with the global identity baseline template according to a preset ratio to generate a search template.

[0005] In some embodiments, the preliminary multi-level screening of the acquired video stream and the extraction of normalized feature vectors based on the recognition network include: Video streams are captured using a camera mounted on a gimbal; A first-target retrieval network is used for primary filtering to identify all potential pet categories and their body bounding boxes in the scene. Based on the results of the first-level screening, a second-level screening is performed using a second-target retrieval network to determine a facial image. The screening accuracy of the second-target retrieval network is higher than that of the first-target retrieval network. The facial images output from the secondary screening are processed using a re-identification network to extract normalized feature vectors.

[0006] In some embodiments, the tracking system performs zero-speed initialization of motion state and generates a constant global identity reference template based on the initialized and locked target position, including: The initial state vector of the Kalman filter is constructed based on the initial visual detection box, and the initial motion state of the target is set to have zero velocity and acceleration on the pixel plane. Centered on the initially locked target location, a region is cropped from the image and features within the region are extracted to generate a constant global identity baseline template, thereby suppressing semantic drift of the tracking model caused by changes in pet pose and lighting.

[0007] In some embodiments, determining the amount of image background drift caused by gimbal rotation, and correcting the tracking center based on the amount of image background drift to counteract visual jitter caused by gimbal self-motion, includes: The angular velocity feedback of the gimbal motor is read in real time, and the angular velocity feedback includes the control input vectors of the angular velocities in the pitch and yaw directions; Based on the focal length of the camera capturing the video stream and the frame interval of the video, a mathematical model describing the camera's own motion is determined. Establish the mapping relationship between the camera's physical control variables and the image pixel space; The estimated target velocity from the previous moment is vector-superimposed with the background drift calculated in the previous step to determine the physical predicted position of the target pet in the image in the current frame.

[0008] In some embodiments, the method further includes: Fusion geometric preprocessing of image pixel space.

[0009] In some embodiments, the fusion geometric preprocessing of the image pixel space includes: Inverse affine transformation model from the target tensor coordinate system to the source image coordinate system; Incorporate boundary padding into the image pixel reading process; Driven by the logical coordinates of the target tensor, the pixels of the image are traversed.

[0010] In some embodiments, the method further includes: Feature extraction is performed based on Siamese networks, where the Siamese network includes a template branch and a search branch. The two branches share the same convolutional neural network weights and receive different inputs respectively. A cross-correlation operator is used to fuse template features and search features in spatial location to generate a response feature map containing target similarity information; The fused feature maps are then classified and regressed to obtain the observation vectors.

[0011] In some embodiments, the adaptive fusion of motor information of the motor driving the rotation and visual information of the video stream includes: The feedback from the motor is used as input to perform Kalman filtering calculations in order to make prior predictions about the target state and compensate for the state deviation caused by the camera's own motion. Based on visual confidence, the motor information of the motor that drives the rotation is adaptively fused with the visual information of the video stream.

[0012] In some embodiments, a dynamically updated dynamic template is established. When performing a target search, the dynamic template is weighted and synthesized with the global identity baseline template according to a preset ratio to generate a search template, including: A dynamically updated template is established, in which a linear weighted moving average strategy is used to fuse the features of the current frame into the dynamic template only when the visual confidence is higher than a preset high threshold. It monitors the duration of target loss in real time and triggers a tiered recovery mechanism based on the duration.

[0013] A pet robot that performs pet identification using any of the methods described in the above embodiments.

[0014] The beneficial effects of the embodiments of this application are as follows: This invention combines identity recognition technology with motor motion compensation algorithm. Through a cascaded recognition process, it locks the pet's unique identity before tracking starts and actively compensates for camera displacement by feeding back motor motion information, ensuring tracking stability at high speeds. Attached Figure Description

[0015] To more clearly illustrate the technical solutions in the embodiments or related technologies of this application, the drawings used in the description of the embodiments or related technologies will be briefly introduced below. Obviously, the drawings described below are only some embodiments recorded in this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0016] Figure 1 This is a flowchart of the pet identification method in this application. Detailed Implementation

[0017] Various embodiments and features of this application are described herein with reference to the accompanying drawings.

[0018] It should be understood that various modifications can be made to the embodiments described herein. Therefore, the above description should not be considered as limiting, but merely as an example of embodiments. Other modifications within the scope and spirit of this application will be apparent to those skilled in the art.

[0019] The accompanying drawings, which are included in and form part of this specification, illustrate embodiments of the present application and, together with the general description of the present application given above and the detailed description of the embodiments given below, serve to explain the principles of the present application.

[0020] These and other features of this application will become apparent from the following description of preferred forms of embodiments given as non-limiting examples, with reference to the accompanying drawings.

[0021] It should also be understood that although this application has been described with reference to some specific examples, those skilled in the art can certainly implement many other equivalent forms of this application.

[0022] The above and other aspects, features and advantages of this application will become more apparent when taken in conjunction with the accompanying drawings and in view of the following detailed description.

[0023] Specific embodiments of this application are described thereafter with reference to the accompanying drawings; however, it should be understood that the claimed embodiments are merely examples of this application, which can be implemented in various ways. Well-known and / or repeated functions and structures are not described in detail to avoid unnecessary or redundant details that could obscure the application. Therefore, the specific structural and functional details claimed herein are not intended to be limiting, but merely serve as the basis and representative basis for the claims to teach those skilled in the art to use this application in a variety of substantially any suitable detailed structures.

[0024] This specification may use the phrases “in one embodiment,” “in another embodiment,” “in yet another embodiment,” or “in other embodiments,” all of which may refer to one or more of the same or different embodiments according to this application.

[0025] To solve the problems in the background technology, combined with Figure 1 This application provides a pet tracking method, comprising the following steps: S1 performs preliminary multi-level filtering of the acquired video stream and extracts normalized feature vectors based on the recognition network.

[0026] For example, a video stream can be acquired via a camera, and the video stream can be initially filtered for targets. Here, multi-level filtering can be performed, but is not limited to two-level filtering, to obtain more accurate pet facial image data. The pet's facial image data is then re-identified to extract a normalized high-dimensional feature vector.

[0027] In some embodiments, S1 may include: The S11 uses a camera mounted on a gimbal to capture video streams.

[0028] During the initialization phase, the system can activate the wide-angle camera to capture video streams, thereby obtaining video data of the pet over a wider range.

[0029] S12 uses a first-target retrieval network for primary filtering to identify all potential pet categories and their body bounding boxes in the scene.

[0030] A cascaded filtering strategy is employed for initial target screening. First, a lightweight, general-purpose target detection network (first-level target retrieval network: such as the YOLO-Nano series) is used to quickly identify all potential pet categories (such as cats or dogs) and their body bounding boxes in the scene during the first-level coarse screening. The first-level screening can output the region of interest (ROI).

[0031] S13, based on the first-level screening results, a second target retrieval network is used for second-level screening to determine a facial image, wherein the screening accuracy of the second target retrieval network is higher than that of the first target retrieval network.

[0032] Next, for the region of interest (ROI) output by the first-level detection, the system runs a more precise pet face-specific detector for a second-level fine-tuning. If a valid face is not detected in this step (e.g., the pet is facing away from the camera or its face is severely obscured), the current frame is automatically discarded, and the system continues to process subsequent frames until a high-confidence facial image is successfully captured.

[0033] S14, the facial images output from the secondary screening are processed by a re-identification network to extract normalized feature vectors, wherein the normalized feature vectors are extracted.

[0034] This step involves extracting facial image regions that have passed secondary precision detection based on Re-ID identity anchors. These regions are then scaled to a standard size and input into a lightweight re-identification network (e.g., a Re-ID Network) to extract a normalized high-dimensional feature vector. The key step is comparing this extracted feature vector with the user-pre-registered pet baseline features using cosine similarity. Only when the calculated similarity exceeds a preset threshold does the system determine that the target pet has been successfully located and record the target's current center coordinates and size as the initial observation state of the entire tracking system.

[0035] S2, the tracking system performs zero-speed initialization of motion state and generates a constant global identity reference template based on the target position locked during initialization.

[0036] For example, when tracking a pet, the system adopts a dual-modal approach of a hybrid tracking system. Considering that the target's motion state is unknown at the initial moment, the system will default to setting its velocity and acceleration on the pixel plane to zero.

[0037] The system can process the images in the video stream, extract the features of the images, and generate a global identity baseline template, centered on the initially locked target location. This template remains unchanged and will not be updated during the pet recognition process, so as to establish a stable identity verification baseline and suppress the semantic drift of the tracking model that may be caused by long-term changes in pet posture, lighting, etc.

[0038] In some embodiments, step S2 may include: S21, construct the initial state vector of the Kalman filter based on the initial visual detection box, and set the initial motion state of the target to have zero velocity and acceleration on the pixel plane.

[0039] The system constructs the initial state vector of the Kalman filter based on the initial visual detection bounding boxes. Considering that the target's motion state is unknown at the initial moment, the system defaults to setting its velocity and acceleration on the pixel plane to zero. At the same time, the corresponding error covariance matrix is also initialized to prepare for subsequent iterative convergence.

[0040] S22, taking the initially locked target location as the center, extracts a region from the image and extracts features within the region to generate a constant global identity baseline template, in order to suppress semantic drift of the tracking model caused by changes in pet posture and lighting.

[0041] Centered on the initially locked target location, the system extracts a region from the image and its features to generate a global identity baseline template. This baseline template remains constant and unupdated throughout the entire tracking process. This is done to establish a stable identity verification baseline, suppressing semantic drift in the tracking model that may be caused by long-term changes in pet posture, lighting, etc.

[0042] S3, determine the amount of image background drift caused by gimbal rotation, and correct the tracking center based on the amount of image background drift to resist visual jitter caused by gimbal self-motion.

[0043] For example, this step can utilize the physical information of the hardware (gimbal motor) to assist the vision algorithm in solving the problem of target loss that may be caused by the rapid rotation of the gimbal.

[0044] In some embodiments, step S3 may include: S31, real-time reading of the angular velocity feedback of the gimbal motor, the angular velocity feedback including control input vectors for the angular velocities in both pitch and yaw directions.

[0045] The system uses motor kinematics feedforward modeling to read the angular velocity feedback of the gimbal motor in real time. This feedback is a control input vector that includes the angular velocities in both pitch and yaw directions.

[0046] S32, Based on the focal length of the camera acquiring the video stream and the frame interval time of the video, determine a mathematical model describing the camera's own motion.

[0047] By combining the camera's focal length and the video's frame interval, the system can establish a mathematical model describing the camera's own motion.

[0048] S33 establishes the mapping relationship between the camera's physical control variables and the image pixel space.

[0049] This step enables a drift mapping from physical space to pixel space.

[0050] To compensate for camera motion, the system directly calculates the image background drift caused by gimbal rotation. Unlike optical flow methods that rely entirely on image information, this method directly establishes a mapping relationship between physical control quantities and image pixel space. The calculation formula is as follows:

[0051] in, It is the camera focal length measured in pixels. It is the frame interval time, and This is the gimbal angular velocity vector in radians per second. The negative sign in the formula intuitively indicates that the pixel drift direction of the image background is opposite to the physical motion direction of the gimbal. By directly correcting the visual coordinates using hardware feedback, the system can effectively achieve line-of-sight stabilization. This method is based on the assumption that the eccentricity error between the camera's optical center and the gimbal's rotation axis can be ignored.

[0052] S34, the estimated target velocity from the previous moment is vector-superimposed with the background drift calculated in the previous step to determine the physical predicted position of the target pet in the image in the current frame.

[0053] This step enables predictive tracking center correction: the system vector-superimposes the target velocity estimate from the previous moment with the background drift calculated in the previous step. In this way, the physical predicted position of the target in the image within the current frame can be calculated. This predicted position will be directly used as the central reference for subsequent image processing, thus effectively resisting visual jitter caused by the gimbal's own motion.

[0054] S6 adaptively fuses the motor information of the motor that drives the rotation with the visual information of the video stream.

[0055] In some embodiments, S60 may include: S61 takes the feedback from the motor as input and performs Kalman filtering calculations to make prior predictions about the target state and compensate for state deviations caused by the camera's own motion.

[0056] In the prediction phase of standard Kalman filtering, the system incorporates the feedback from the motor as an external control input into the calculation to make prior predictions about the target state, thereby effectively compensating for state deviations caused by the camera's own motion.

[0057] S62 adaptively fuses motor information of the motor driving the rotation with visual information from the video stream based on visual confidence.

[0058] To flexibly handle situations such as visual occlusion or image blur, the system constructs an adaptive observation noise covariance matrix. It uses the peak value of the classification response output of the neural network to quantify the confidence level of the current visual measurement. (This value is normalized to the range of 0-1), and the system's confidence in visual measurements is dynamically adjusted accordingly. The relationship is as follows:

[0059] in, It is a pre-defined benchmark observation noise covariance matrix representing ideal observation conditions.

[0060] When visual confidence High time, As the value approaches zero, the filter performs a strong correction, ensuring the system closely follows the visual measurement results. Conversely, when visual confidence is low, As the filter size increases, it performs weak corrections, causing the system to rely more on its own inertia and the motor's kinematic predictions, thus entering a pure kinematic prediction tracking mode. To ensure the numerical stability of the entire filtering process, in practical engineering implementations, it is necessary to... To avoid [problems], set a non-zero lower bound, or add a very small positive definite matrix as a regularization term to the formula. At extremely high times The matrix becomes a singular matrix, causing the calculation to fail.

[0061] S7. Establish a dynamically updated dynamic template. When performing a target search, the dynamic template and the global identity baseline template are weighted and synthesized according to a preset ratio to generate a search template.

[0062] In some embodiments, step S70 may include: S71, establish a dynamically updated dynamic template, wherein only when the visual confidence is higher than the preset high-order threshold, a linear weighted moving average strategy is adopted to fuse the features of the current frame into the dynamic template.

[0063] The system establishes a feature pool containing two templates and employs differentiated update strategies. The static template, the previously built global identity baseline, remains unchanged throughout the entire tracking session, serving as the final line of defense for identity verification. The dynamic template, however, changes over time: only when the visual confidence level exceeds a preset high-order threshold does the system use a linearly weighted moving average strategy to fuse the features of the current frame into the dynamic template, adapting to short-term changes such as pet pose and lighting. During target search, the system weights and synthesizes the features from the static and dynamic templates according to a preset ratio to generate the final search template.

[0064] S72 monitors the duration of target loss in real time and triggers a graded recovery mechanism based on the duration.

[0065] The system monitors the duration of target loss in real time and triggers a tiered recovery mechanism based on the duration. If the target loss time is short (less than a preset short-term threshold), the system determines it as temporary occlusion. In this case, it pauses updating the dynamic template, relies solely on Kalman filtering inertial prediction to control the gimbal, and actively expands the visual search range of the next frame (e.g., by reducing the scaling factor to cover a larger area). If the target loss time is long (reaching or exceeding the threshold), the system determines that the target has been completely lost. In this case, it stops motion prediction and initiates a global re-detection mechanism, returning to the initial steps to re-detect all potential targets in the entire frame and compare them one by one with the stored static templates for similarity. Once a match is successful, the system immediately resets the filter state, thus achieving closed-loop re-entry of tracking.

[0066] In some embodiments, the method further includes: S4 is a fusion-based geometric preprocessing technique for the image pixel space.

[0067] This step can be executed after S3, performing a fusion-style aggregate processing of pixel space to address the core pain point of limited memory bandwidth in embedded devices. This method abandons the traditional multi-step serial processing mode of "crop-fill-scale," replacing it with a fusion-style geometric transformation logic. It should be noted that the so-called "zero-copy generation" specifically refers to the process of generating the input tensors required by the neural network from the source image without allocating a separate image buffer for storing intermediate results such as cropping and scaling, thereby minimizing memory I / O at the application level.

[0068] In some embodiments, S4 may include: S41, the inverse affine transformation model from the target tensor coordinate system to the source image coordinate system.

[0069] First, the system establishes an inverse affine transformation model from the target tensor coordinate system to the source image coordinate system. Let the physical prediction position obtained from step S34 be... Based on the required input tensor size of the neural network. Forecast Center And scaling factor The following inverse mapping matrix can be constructed. :

[0070] This matrix actually defines a virtual sampling path. For any pixel in the target output tensor, the system can precisely backtrack to its unique floating-point coordinates in the source image using this formula. This mechanism allows the system to completely avoid the step of physically cropping the entire region of interest (ROI).

[0071] S42 incorporates boundary padding into the image pixel reading process.

[0072] To seamlessly handle potential padding issues during single-step operations, the system defines the effective physical domain of the source image. Next, a piecewise sampling function is constructed to integrate boundary padding logic into the pixel reading process. The logic is as follows:

[0073]

[0074] Here, This refers to the pixel values of the target input tensor. It also includes the set of neighboring pixels required to perform interpolation operations (such as bilinear interpolation). Completely located within the effective image area If the pixel value is within the specified range, the interpolation calculation will proceed normally; otherwise, the pixel value at that point will be assigned a preset fill constant. (e.g., mean grayscale).

[0075] S43 iterates through the pixels of the image using the logical coordinates of the target tensor as the loop driver.

[0076] Based on the above model, the system only needs to perform a single pixel-level traversal to complete the task. The entire process is driven by the logical coordinates of the target tensor, rather than the source image coordinates. At each step of the traversal, the system calculates the source coordinates in real time using formula (4-1), and then performs validity judgment and sampling using the logic of formula (4-2). The calculated pixel values are directly written to the corresponding address in the neural network input buffer. This entire process compresses the traditional "read-write-read-write" operation, which requires multiple memory reads and writes, into a single "read (partial)-write (final)" process, greatly reducing processing latency by avoiding the allocation and reading / writing of independent intermediate image buffers.

[0077] In some embodiments, the method further includes the following steps, which may be performed after step S4: S51, feature extraction is performed based on Siamese networks, where the Siamese network includes a template branch and a search branch. The two branches share the same convolutional neural network weights and receive different inputs respectively.

[0078] The network consists of a template branch and a search branch. The two branches share the same set of convolutional neural network weights (such as lightweight backbone networks like MobileNetV3 and ShuffleNet), but they receive different inputs: the template branch receives the baseline / dynamic hybrid template maintained in subsequent steps, while the search branch receives the search region of the current frame generated in the previous steps.

[0079] S52 uses a cross-correlation operator to fuse template features and search features in spatial location to generate a response feature map containing target similarity information.

[0080] To enable real-time operation on low-computing-power chips, this invention employs lightweight cross-correlation mechanisms, such as pointwise correlation or depthwise correlation operators, in the feature fusion stage. These operators avoid full-dimensional matrix multiplication and instead perform channel-level fusion of template features and search features at spatial locations, thereby generating a response feature map that includes target similarity information.

[0081] S53. The fused feature map is then subjected to classification and regression processing to obtain the observation vector.

[0082] The fused feature maps are fed into the parallel task heads of classification and regression. At the network output, a coordinate post-processing module uses the previously recorded prediction center and scaling factor to directly restore the absolute pixel coordinates of the original image from the relative normalized offset output by the regression branch through linear mapping, ultimately outputting an accurate observation vector.

[0083] This application does not begin with vague tracking, but rather ensures the uniqueness of the target through a cascading identification process. Specifically, the system first roughly locates the pet's body in the image, and then precisely positions the more identifiable face within that area. Next, the system extracts the facial features and compares them with pre-registered pet identity files. Only when the correct pet is confirmed will the system officially start the tracker, using the target's current location, appearance, and other information as the anchor point for all subsequent tracking efforts. This mechanism of confirming identity before initiating tracking fundamentally eliminates the problem of tracking the wrong pet when multiple similar pets are present.

[0084] Secondly, to address the issue of cameras losing track of their targets during rapid rotation, this technology also discloses a closed-loop tracking method that deeply integrates camera motion with the tracking algorithm. Its ingenuity lies in treating the camera's own motion as a distraction, rather than as useful information. The system reads motion data such as the angular velocity of the motor driving the camera's rotation (i.e., the gimbal) in real time and builds a mathematical model to accurately predict the image displacement caused by these movements. Then, it proactively feeds this predicted displacement to the tracking algorithm to counteract the image shake caused by the camera's own motion. In this way, even when the camera is rotating at high speed to track a pet, the tracker maintains extremely high stability and accuracy.

[0085] Furthermore, one of the core technologies of this invention is an image preprocessing method optimized for embedded devices (such as pet robots). Traditional methods typically require multiple steps such as cropping, scaling, and padding to prepare the input image needed by the neural network. This generates a lot of temporary image cache, which consumes valuable memory and time. This invention completely bypasses this process through an innovative reverse address mapping technique. It directly traverses every pixel of the small image to be generated and then uses a mathematical formula to calculate in reverse where this point should be virtually sampled in the source image. In this process, it can complete all the cropping, scaling, and boundary padding work in one step without any intermediate cache, thus directly and efficiently feeding the image data into the neural network, greatly improving the processing speed.

[0086] To enable rapid retrieval of a target after it has been inadvertently lost, this technology employs an efficient local re-identification strategy. When tracking fails, the system does not immediately initiate a time-consuming and resource-intensive global search. Instead, it utilizes a dynamically maintained memory pool containing the last few clear visual features of the target before its loss. Using these features, the system performs targeted and rapid identity matching within a small local area predicted based on the target's historical trajectory. Through this "prioritizing key areas" strategy, the system can quickly recapture the target with minimal computational overhead. Only when this local search also fails will a global search be initiated as a last resort.

[0087] Finally, to ensure that all the above algorithms can run smoothly on resource-constrained embedded AI chips, this technology also incorporates systematic hardware optimizations in the design of the neural network model. It intentionally selects activation functions that are particularly friendly to fixed-point operations (such as ReLU6) and avoids complex calculations that can easily lead to significant precision loss when converting from floating-point to fixed-point numbers. This hardware-centric approach from the design level ensures that the entire tracking algorithm can be deployed with extremely high efficiency and very low power consumption on various embedded devices.

[0088] This application also provides a pet robot that performs pet identification using any of the methods described in the above embodiments.

[0089] Pet robots can be, but are not limited to, cat litter boxes, pet feeding equipment, etc.

[0090] The foregoing has described in detail several embodiments of this application, but this application is not limited to these specific embodiments. Those skilled in the art can make various variations and modifications based on the concept of this application, and all such variations and modifications should fall within the scope of protection claimed in this application.

Claims

1. A pet tracking method, characterized in that, include: The acquired video stream is subjected to preliminary multi-level target screening, and normalized feature vectors are extracted based on the recognition network; The tracking system performs zero-speed initialization of motion state and generates a constant global identity reference template based on the target position locked during initialization; Determine the amount of image background drift caused by gimbal rotation, and based on the amount of image background drift, correct the tracking center to counteract the visual jitter caused by gimbal self-motion; The motor information of the motor that drives the rotation is adaptively fused with the visual information of the video stream. A dynamically updated template is created. When performing a target search, the dynamic template is weighted and combined with the global identity baseline template according to a preset ratio to generate a search template.

2. The method according to claim 1, characterized in that, The process of performing preliminary multi-level target filtering on the acquired video stream and extracting normalized feature vectors based on the recognition network includes: Video streams are captured using a camera mounted on a gimbal; A first-target retrieval network is used for primary filtering to identify all potential pet categories and their body bounding boxes in the scene. Based on the results of the first-level screening, a second-level screening is performed using a second-target retrieval network to determine a facial image. The screening accuracy of the second-target retrieval network is higher than that of the first-target retrieval network. The facial images output from the secondary screening are processed using a re-identification network to extract normalized feature vectors.

3. The method according to claim 1, characterized in that, The tracking system performs zero-speed initialization of the motion state and generates a constant global identity reference template based on the initialized and locked target position, including: The initial state vector of the Kalman filter is constructed based on the initial visual detection box, and the initial motion state of the target is set to have zero velocity and acceleration on the pixel plane. Centered on the initially locked target location, a region is cropped from the image and features within the region are extracted to generate a constant global identity baseline template, thereby suppressing semantic drift of the tracking model caused by changes in pet pose and lighting.

4. The method according to claim 1, characterized in that, The determination of the image background drift caused by gimbal rotation, and the correction of the tracking center based on the image background drift to counteract visual jitter caused by gimbal self-motion, includes: The angular velocity feedback of the gimbal motor is read in real time, and the angular velocity feedback includes the control input vectors of the angular velocities in the pitch and yaw directions; Based on the focal length of the camera capturing the video stream and the frame interval of the video, a mathematical model describing the camera's own motion is determined. Establish the mapping relationship between the camera's physical control variables and the image pixel space; The estimated target velocity from the previous moment is vector-superimposed with the background drift calculated in the previous step to determine the physical predicted position of the target pet in the image in the current frame.

5. The method according to claim 1, characterized in that, The method further includes: Fusion geometric preprocessing of image pixel space.

6. The method according to claim 5, characterized in that, The fusion-based geometric preprocessing of the image pixel space includes: Inverse affine transformation model from the target tensor coordinate system to the source image coordinate system; Incorporate boundary padding into the image pixel reading process; Driven by the logical coordinates of the target tensor, the pixels of the image are traversed.

7. The method according to claim 1, characterized in that, The method further includes: Feature extraction is performed based on Siamese networks, where the Siamese network includes a template branch and a search branch. The two branches share the same convolutional neural network weights and receive different inputs respectively. A cross-correlation operator is used to fuse template features and search features in spatial location to generate a response feature map containing target similarity information; The fused feature maps are then classified and regressed to obtain the observation vectors.

8. The method according to claim 1, characterized in that, The adaptive fusion of motor information of the motor driving the rotation with visual information from the video stream includes: The feedback from the motor is used as input to perform Kalman filtering calculations in order to make prior predictions about the target state and compensate for the state deviation caused by the camera's own motion. Based on visual confidence, the motor information of the motor that drives the rotation is adaptively fused with the visual information of the video stream.

9. The method according to claim 1, characterized in that, A dynamically updated template is established. During target search, the dynamic template is weighted and synthesized with the global identity baseline template according to a preset ratio to generate a search template, including: A dynamically updated template is established, in which a linear weighted moving average strategy is used to fuse the features of the current frame into the dynamic template only when the visual confidence is higher than a preset high threshold. It monitors the duration of target loss in real time and triggers a tiered recovery mechanism based on the duration.

10. A pet robot, characterized in that, Pet identification is performed using the method described in any one of claims 1 to 9.