Methods, apparatuses, and systems for tracking a target in a sequence of video frames

By switching the operating mode in the target tracking filter and combining it with the motion vector of the video encoder, the problem of poor response of the target tracking model in the prior art when the speed and direction change is solved, and a smooth and fast state estimation effect is achieved.

CN122244088APending Publication Date: 2026-06-19AXIS

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
AXIS
Filing Date
2025-12-12
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing target tracking filters struggle to provide both smooth state estimation and fast response when faced with sudden changes in target velocity and orientation, and common motion models often perform poorly in certain situations.

Method used

One approach involves a tracking filter operating in two modes: a first mode estimates the target's position and velocity, while a second mode estimates the target's position, velocity, and acceleration. The mode switch is determined by detecting velocity differences using motion vectors provided by the video encoder, switching between the motion model and the state-space model.

Benefits of technology

It achieves rapid response when the target's velocity and direction change suddenly, while providing smooth state estimation under steady conditions, thus improving the accuracy and robustness of target tracking.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122244088A_ABST
    Figure CN122244088A_ABST
Patent Text Reader

Abstract

A method, apparatus, and system are provided for tracking a target in a sequence of video frames using a tracking filter that can operate in two modes. In a first mode, the tracking filter estimates a state vector describing the position and velocity of the target in the video frame. In a second mode, the tracking filter estimates a state vector describing the position, velocity, and acceleration of the target in the video frame. The method includes switching from the first mode to the second mode. The switching is performed in response to a velocity difference between the target velocity estimated by the tracking filter and the target velocity indicated by a motion vector from a video encoder being greater than a threshold.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of target tracking in video. Specifically, this invention relates to methods, apparatus, and systems for tracking targets in a sequence of video frames. Background Technology

[0002] Tracking targets in video is a common task in computer vision. For example, in surveillance applications, it is of interest to track people in a monitored scene through video. This can be used to count people, issue alerts when someone enters a restricted area, or lingers in a certain area of ​​the scene for too long.

[0003] Target tracking typically involves using a tracking filter that measures the state of the target, such as its position, velocity, and acceleration. The tracking filter is a recursive algorithm that evolves a state vector that describes the target's state when it is detected in a video. To this end, the tracking filter uses a motion model to predict the target's state vector at future time points and a measurement model to update the predicted state vector as the target is detected in video frames at those future time points.

[0004] The performance of a tracking filter is particularly dependent on the motion model being used. Common choices are constant velocity models, where the tracking filter tracks the target's position and velocity under the assumption of constant velocity, and constant acceleration models, where the tracking filter additionally tracks the target's acceleration under the assumption of constant acceleration. Each of these motion models has its advantages and disadvantages. On the one hand, constant velocity models exhibit good performance and provide smooth state estimates when the target moves at approximately a constant velocity. However, they become difficult to track when the target suddenly changes its velocity or direction, potentially leading to a loss of the target's trajectory. On the other hand, constant acceleration models are better suited to sudden changes in the target's velocity and direction. However, they tend to provide noisier state estimates because they are also more susceptible to variations caused by noise in the detection. Accordingly, it can be difficult to choose which motion model to use in the tracking filter. Therefore, there is room for improvement. Summary of the Invention

[0005] In view of the above, the object of the present invention is to alleviate the above problems and to provide a tracking filter that can adapt to sudden changes in the velocity and orientation of the target while still providing smooth state estimation.

[0006] According to the first aspect, the above objective is achieved by a method for tracking a target in a sequence of video frames, the method comprising:

[0007] The target in a sequence of video frames is tracked by operating a tracking filter in a first mode, in which the tracking filter estimates a state vector describing the position and velocity of the target in the video frame.

[0008] The tracking filter estimates the state vector at the current time point based on the state vector at an earlier time point and the detection of the target in the current video frame corresponding to the current time point.

[0009] The video encoder, which encodes a sequence of video frames, receives one or more motion vectors that indicate the velocity in the target detection region of the current video frame corresponding to the target detection.

[0010] Determine the velocity difference between the velocity in the state vector at an earlier time point and the velocity in the target detection region of the current video frame, indicated by one or more motion vectors; and

[0011] In response to a velocity difference greater than a velocity threshold, at the current time point, the tracking filter is switched from operating in the first mode to operating in the second mode, where the tracking filter in the second mode estimates a state vector describing the position, velocity, and acceleration of the target in the video frame.

[0012] Therefore, the proposed tracking filter can operate in two modes. In the first mode, the tracking filter estimates a state vector describing the target's position and velocity. By omitting acceleration states in the state vector in the first mode, the tracking filter becomes less responsive to sudden changes in the target's velocity and orientation. This is advantageous when the target does not undergo large velocity or orientation changes, as the tracking filter also responds less to changes caused by noise in the detection, thus providing a smoother state estimate. In the second mode, the state vector additionally includes the target's acceleration. Acceleration states make the tracking filter more responsive to sudden changes in the target's velocity and orientation, and are therefore advantageous when the target undergoes sudden velocity or orientation changes.

[0013] The inventors have recognized that by switching between operating the tracking filter in a first mode and operating it in a second mode, the advantages of both modes can be achieved while minimizing their disadvantages. The idea is to use the first mode when the target does not undergo large velocity changes to achieve the advantage of smooth state estimation, and to switch to the second mode when a sudden change in the target velocity is detected to achieve the advantage of being more responsive to velocity changes.

[0014] To detect sudden changes in target velocity, secondary motion information in the form of motion vectors from a video encoder that encodes the video sequence in which the target is tracked is used. These motion vectors provide an indication of the target's current velocity. By comparing the velocity indicated by the motion vectors from the video encoder with the velocity estimate from the tracking filter, it is possible to obtain an early indication when the target's velocity suddenly begins to change. When this occurs, the tracking filter is switched from a first mode to a second mode.

[0015] A tracking filter is an algorithm that uses a series of noisy measurements observed over time to generate estimates of unknown variables. Tracking filters can also be called statistical motion filters. For example, a tracking filter can be a Kalman filter, an extended Kalman filter, or a statistical particle filter. Unknown variables can include the target's position, velocity, and possibly acceleration. These unknown variables are also called states and can be arranged in a state vector.

[0016] The tracking filter can operate in both a first mode and a second mode. The difference between the two modes is that the tracking filter uses different state-space models to model the state vector. In the first mode, the state-space model models the target's position and velocity, but not its acceleration. Therefore, in the first mode, the state vector includes the state position and velocity, but not the acceleration. In the second mode, the state-space model models the target's position, velocity, and acceleration. Therefore, in the second mode, the state vector includes the state position, velocity, and acceleration.

[0017] Position, velocity, and acceleration typically refer to the position, velocity, and acceleration of an object in the image plane of a video frame. Position refers to the pixel position of the object in the video frame. Velocity refers to the pixel velocity of the object in the video frame, i.e., the change in pixel position per unit of time. Acceleration refers to the pixel acceleration of the object in the video frame, i.e., the change in pixel velocity per unit of time. However, in embodiments, tracking is preferably performed in a two-dimensional or three-dimensional coordinate system of the scene depicted by a sequence of video frames (e.g., in a two-dimensional top-down map of the scene or a three-dimensional map of the scene). In such embodiments, position, velocity, and acceleration refer to position, velocity, and acceleration in the two-dimensional or three-dimensional coordinate system of the scene.

[0018] Switching from operating the tracking filter in the first mode to operating the tracking filter in the second mode means that when estimating the state vector of the target, the tracking filter switches from using the state-space model of the first mode to using the state-space model of the second mode.

[0019] Switching from operating the tracking filter in the first mode to operating it in the second mode at the current time point may include expanding the state vector at the current time point using the acceleration state describing the target's acceleration. Therefore, at the current time point when a velocity difference is detected, the acceleration state is added to the state vector.

[0020] When an acceleration term is added to the state vector at the current time point, it must be initialized, i.e., set to its initial value. In one embodiment, the acceleration state is initialized to zero at the current time point. In this case, due to the state evolution of the tracking filter, the acceleration state will adapt to the target's acceleration when detection is performed in future frames.

[0021] In another embodiment, the acceleration state is initialized at the current time point to the acceleration corresponding to the velocity difference. The velocity difference provides an indication of how the target's velocity has changed between an earlier time point and the current time point. Therefore, dividing the velocity difference by the time difference between the current time point and an earlier time point provides an estimate of the target's acceleration. By making an initial estimate of acceleration in this way and using it to initialize the acceleration state, the tracking filter will adapt to the target's acceleration more quickly than if the acceleration state were simply initialized to zero.

[0022] In another embodiment, the acceleration state is initialized such that the state vector from an earlier time point, together with the acceleration state, produces a predicted state vector for the current time point, which matches the detection of the target in the current video frame. This is an alternative way to initially estimate the acceleration state, and compared to simply initializing the acceleration state to zero, this approach allows the tracking filter to adapt to the target's acceleration more quickly.

[0023] In both embodiments, an extended state vector is used when estimating the state vector after the current time point. Therefore, the switch from the first mode to the second mode occurs at the current time point due to the extension of the state vector, and the tracking filter then operates in the second mode at least after the current time point by estimating the extended state vector.

[0024] The tracking filter uses a motion model to predict the state vector from one point in time to consecutive points in time. This motion model is a constant-velocity motion model in a first mode and a constant-acceleration motion model in a second mode. Therefore, switching from the first mode to the second mode upon detecting a sudden change in velocity can involve switching the motion model. By using a constant-velocity motion model in the first mode, the tracking filter is less sensitive to noise in the detection and provides a smooth state vector estimate. By using a constant-acceleration model in the second mode, the tracking filter is more responsive to velocity changes.

[0025] Furthermore, if target detection is performed in consecutive video frames corresponding to consecutive time points, the tracking filter updates the predicted state vector for each consecutive time point based on the target detection in each of the first and second modes. Therefore, at time points where no target detection is performed, the estimated state vector is equal to the state vector predicted using the motion model of the current application mode. However, at time points where target detection is performed, the estimated state vector further considers this detection. In this way, when the target deviates from the motion model, the tracking filter adapts its estimated state vector to the target's actual motion.

[0026] Video encoders typically determine motion vectors for pixel blocks (e.g., 8×8 or 16×16 pixel blocks) within a video frame. The target detection region in the current video frame may cover more than one pixel block and is therefore associated with multiple motion vectors. In this case, the velocity indicated by one or more vectors in the target detection region of the current video frame can be the velocity indicated by a representative motion vector of one or more motion vectors. For example, the representative motion vector could be the average motion vector.

[0027] One or more motion vectors may correspond to the displacement between the current video frame and a reference frame used in encoding by the video encoder, and the method may further include: calculating the velocity indicated by the one or more motion vectors by removing bits to account for the time distance between the current video frame and the reference frame. In this way, the displacement given by the motion vectors is converted into velocity, which can be used to determine the velocity difference relative to the velocity in the state vector at an earlier time point.

[0028] The method may further include: monitoring the acceleration in the estimated state vector while the tracking filter is operating in the second mode, and switching the tracking filter from operating in the second mode to operating in the first mode in response to the acceleration in the estimated state vector having stabilized below an acceleration threshold. Thus, when the acceleration has stabilized at a low level (an indication that the target's velocity, to which the tracking filter needs to adapt, is not currently changing significantly), the tracking filter can switch back to the first mode. Therefore, when the target's acceleration is low, the tracking filter is controlled to operate in the first mode to obtain the benefit of smooth state estimation, and when the target's acceleration is high, the tracking filter is controlled to operate in the second mode to obtain the benefit of responsiveness to velocity changes.

[0029] When the acceleration in the state vector has been below the acceleration threshold for a predetermined time period, the acceleration can be determined to have stabilized at a level below the acceleration threshold. By requiring that the acceleration has been below the acceleration threshold for a predetermined time period, the tracking filter can be prevented from switching back and forth between the first mode and the second mode when the acceleration is temporarily below the acceleration threshold.

[0030] Switching from operating the tracking filter in the second mode to operating it in the first mode involves removing the acceleration states describing the target's acceleration from the state vector. It may further include switching from using a constant acceleration motion model to using a constant velocity motion model.

[0031] According to the second aspect, the aforementioned objective is achieved by means of an apparatus for tracking a target in a sequence of video frames. This apparatus includes circuitry configured to perform the steps of the method of the first aspect.

[0032] According to the third aspect, the above objective is achieved by a system. The system includes: a video encoder configured to encode a sequence of video frames and generate motion vectors indicating velocity in different regions of the video frames; a target detector configured to detect targets in the sequence of video frames; and means for tracking targets in the sequence of video frames according to the second aspect, wherein the means receives motion vectors from the video encoder and target detections from the target detector.

[0033] According to the fourth aspect, the above objective is achieved by a non-transitory computer-readable medium including computer program code, which, when executed by a processing-capable device, causes the device to implement the method of the first aspect.

[0034] The second, third, and fourth aspects can generally have the same features and advantages as the first aspect. It should be further noted that, unless otherwise explicitly stated, the present invention relates to all possible combinations of features. Attached Figure Description

[0035] The above and additional objects, features and advantages of the invention will be better understood from the following illustrative and non-limiting detailed description of embodiments of the invention with reference to the accompanying drawings, in which the same reference numerals will be used for similar elements, in which:

[0036] Figure 1 The image shows a scene captured by a camera.

[0037] Figure 2 The diagram schematically illustrates a system for tracking a target in a sequence of video frames according to an embodiment.

[0038] Figure 3 The performance of the tracking filter is illustrated schematically when operating only in the first mode and only in the second mode.

[0039] Figure 4 This is a flowchart of a method for tracking a target in a sequence of video frames according to an embodiment.

[0040] Figure 5The tracking filter illustrated in the embodiment estimates the state vector at consecutive time points by switching between two operating modes.

[0041] Figure 6 The illustration schematically depicts the motion vectors in a video frame where a target is detected in the target detection region.

[0042] Figure 7 The diagram schematically illustrates the acceleration as a function of time, estimated by the tracking filter.

[0043] Figure 8 The illustration schematically depicts the performance of the tracking filter when switching between a first mode and a second mode according to an embodiment.

[0044] Figure 9 The illustration shows an apparatus for tracking a target in a sequence of video frames according to an embodiment. Detailed Implementation

[0045] In the following description, the invention will now be described more fully with reference to the accompanying drawings, in which embodiments of the invention are illustrated.

[0046] Figure 1 The illustration shows a camera 100 monitoring an exemplary scene 102 in which a moving target 104 that should be tracked over time exists. The target 104 may be, for example, a person or a vehicle. The camera 100 is a video camera that captures a sequence of video frames depicting scene 102.

[0047] The sequence of video frames captured by camera 100 can be obtained by Figure 2 The system 200 shown is used to track target 104. System 200 may be implemented entirely within camera 100, partially within camera 100, or may be provided separately from camera 100. System 200 includes a video encoder 202, means 204 for tracking a target in a sequence of video frames, and a target detector 206. Means 204 will be referred to hereinafter as tracker 204.

[0048] A sequence of video frames 210 is provided as input to system 200. Video encoder 202 encodes the sequence of video frames 210 using motion compensation. As known in the art, motion compensation is a technique used to predict video frames in a video sequence relative to a reference video frame, which typically corresponds to another video frame in the sequence (e.g., the preceding video frame, the following video frame, or both). When the reference frame is the preceding video frame in the video sequence, the encoded video frame is called a P-frame, and when the reference frame includes the following video frame, the encoded video frame is called a B-frame. For example, when in time... When captured video frames are encoded as P-frames, the reference frame can be compared with the time frame in the video sequence. The captured video frame corresponds directly to the preceding video frame. Motion compensation is implemented by most video coding standards such as H.26x, AV1, and VP9. To perform motion compensation on the video frame to be encoded, the video encoder 202 determines the motion vectors of pixel blocks in the video frame by performing a motion vector search in the reference video frame. More specifically, for the pixel block to be encoded, the video encoder 202 searches within a search window in the reference video frame for the best-matching pixel block according to some criteria. For example, the best-matching pixel block can minimize the sum of absolute differences relative to the block to be encoded. Thus, the motion vector of the pixel block is the vector from the pixel block in the video frame to the best-matching block in the reference frame found during the motion vector search. Therefore, the motion vector of the pixel block can be viewed as the displacement of the image content in the pixel block in the video frame to be encoded relative to the reference frame. In other words, the motion vector indicates the velocity in the pixel block of the video frame; that is, it provides a measure of how much and in which direction the image content in the pixel block has moved during the time interval between the video frame and the reference frame.

[0049] The sequence of video frames 210 is also provided as input to the object detector 206, which detects targets in the video frames. This includes identifying and locating specific targets in the video frames by determining the region (referred to herein as the object detection region) where the target is located. The object detection region can be, for example, given as the bounding box around the detected target. For this purpose, the object detector 206 can implement existing object detection algorithms. One object detection method involves using a convolutional neural network (CNN) to analyze spatial features across an image to identify patterns associated with different object categories. Object detection models are generally divided into two categories: single-level detectors and two-level detectors. Single-level detectors, such as YOLO (You Only See Once) and SSD (Single-Step Multi-Box Detector), directly predict bounding boxes and class labels through the network, making them effective for real-time applications. Two-level detectors, such as the faster R-CNN (Region-Based Convolutional Neural Network), divide the detection process into two steps: generating region proposals and then classifying these proposals, which tends to result in higher accuracy but slower processing speed. To implement object detection in practice, the image is first passed through a CNN that generates feature maps. For single-level models like YOLO, the network directly outputs the bounding box coordinates, class label, and confidence score for each detected object. For two-level models like the faster R-CNN, a region proposal network generates initial bounding boxes, which are further processed by a classification network to improve predictions. Training an object detection model requires a large dataset annotated with bounding boxes and class labels for each object. The model learns to minimize a multipart loss function that combines classification loss (for correct labeling) and localization loss (for accurate bounding box prediction). During training, data augmentation techniques such as scaling, cropping, and flipping are typically used to improve the model's robustness and generalization ability.

[0050] The target detector 206 outputs a target detection 216 indicating the location of a target detected in a video frame and preferably also indicating its category label. For example, the target detection 216 may correspond to a set of target detection regions (e.g., bounding boxes) that indicate the detected targets and their locations within the video frame and have associated category labels and confidence scores.

[0051] Motion vector 214 generated by video encoder 202 during the encoding process and target detection 216 from target detector 206 are provided as inputs to tracker 204. Tracker 204 tracks the target detected by target detector 206 over time in the sequence of video frames 210. For this purpose, tracker 204 uses a tracking filter, also known as a statistical motion filter. For example, it can use a Kalman filter and an extended Kalman filter or a statistical particle filter. The tracking filter is an algorithm that uses a series of noisy measurements observed over time (in this case, target detection 216 provided by target detector 206) to generate and output estimates of unknown variables. Unknown variables may include the target's position, velocity, and possibly acceleration. These unknown variables are also called states and may be arranged in a state vector.

[0052] The evolution of states in the state vector across consecutive time points is modeled using a state-space model that includes both a motion model and a measurement model. The motion model describes the state vector. How to start from a point in time Evolving to continuous time points That is, it models the dynamics of the target. For a linear motion model, this can be expressed as:

[0053] (Equation 1)

[0054] in, It is time The state at that time, It is applied to the previous state vector The state transition matrix, and It is process noise. If a Kalman filter is used, it is assumed that the process noise follows a pattern with a covariance matrix. The zero-mean multivariate Gaussian distribution, the covariance matrix Depending on the current time point Compared to the previous time point The time interval between Sometimes, process noise is also called system noise.

[0055] Measurement model for time How the measured value of time is related to time The state vector correlation at time is modeled. In the case of a linear measurement model, this can be expressed as:

[0056] (Equation 2)

[0057] in, It is time State vector and measurement value at time The relevant matrix, and It is additive measurement noise. If a Kalman filter is used, it is assumed that the measurement noise follows a pattern with a covariance matrix. The zero-mean multivariate Gaussian distribution.

[0058] Tracker 204 has a tracking filter that can operate in two modes. The tracking filter operates in one mode at a time, but as will be explained, it can switch between modes. The state-space model of the tracking filter differs between the two modes. Therefore, in the first mode, a first state-space model is used, and in the second state, a second state-space model is used. The state-space model in the first mode models the target's position and velocity. Therefore, the state vectors in the first mode describe the target's position and velocity, not acceleration. Because the first state-space model does not model acceleration, it is assumed that the target has no acceleration and therefore moves at a constant velocity. Therefore, the motion model in the state-space model of the first mode can be called a constant velocity model. That is, the state transition matrix of the first state-space model is designed... This ensures that the velocity term in the state vector remains unchanged when applied to it. This state transition matrix can be easily constructed using Newton's equations of motion and the assumption of constant velocity.

[0059] The second-mode state-space model models the target's position, velocity, and acceleration. Therefore, the state vectors in the second-mode model describe the target's position, velocity, and acceleration. In the second-mode state-space model, it is assumed that the target moves with constant acceleration. The motion model in the second-mode state-space model can be called the constant acceleration model. That is, the state transition matrix of the second-mode state-space model... Designed such that when the state transition matrix When applied to the state vector, the acceleration term remains unchanged. Such a state transition matrix can be easily constructed using Newton's equations of motion and the assumption of constant acceleration.

[0060] Each of these two models has its own advantages and disadvantages. (Reference) Figure 3The target's true but unknown trajectory 302 in the image plane is shown as a solid line, and the noisy measurement 304 of the target's position is indicated by a crosshair. In this example, it is assumed that the target initially moves at a constant speed, but then suddenly changes its direction of motion. Dashed curve 306 illustrates the tracking result of the tracking filter if it operates only in the first mode, and dotted curve 308 illustrates the tracking result of the tracking filter if it operates only in the second mode. As can be seen, the first operating mode works well as long as the target moves at a constant speed. However, when the target suddenly changes its direction, the tracking filter responds very slowly and struggles to follow the target's true motion. In contrast, the second operating mode works less well when the target moves at a constant speed because it is more adaptable to fluctuations in noisy measurements. However, its adaptability becomes an advantage when the target changes its direction, and it responds much better to changes in speed.

[0061] To take advantage of the strengths of both modes and mitigate their respective weaknesses, the tracking filter of tracker 204 is configured to dynamically switch between the two operating modes. The switching decision is made using motion vector 214 from video encoder 202. Reference will now be made to a method illustrating tracking a target in a sequence of video frames. Figure 4 This will be described in more detail. Further references will be made. Figure 1 Exemplary scenarios and Figure 2 The system.

[0062] In step S02, tracker 204 tracks target 104 in a sequence of video frames 210 by operating a tracking filter in a first mode. In the first mode, tracker 204 estimates a state vector describing the position and velocity of target 104 in video frames 210. In the first mode, the tracking filter uses the first state-space model described above to model the state vector and its evolution over time.

[0063] The tracking filter operates at a certain time rate to estimate the state vector at consecutive time points. This is in Figure 5 The timeline is further illustrated, showing consecutive time points along which the tracking filter estimates the state vector. Each of these time points corresponds to a video frame in the sequence of video frames 210. The time rate of the tracking filter can correspond to the frame rate of the sequence of video frames 210 (e.g., 30 or 60 frames per second), or to the rate at which the target detector 206 operates to detect targets in the sequence of video frames 210 (e.g., 10 frames per second). At consecutive time points... At some points in the image, target detection 216 is received from target detector 206. As previously described, target detection may include target detection regions such as bounding boxes, where the target is located within the image frame. Figure 5 As shown, the detection 216 of the tracked target 104 does not need to be received at every time point. For example, the target 104 may be occluded by the camera 100 at some time points and therefore cannot be detected by the target detector 206.

[0064] To estimate the state vector, the tracking filter uses a time point where the tracking filter employs a time-varying method. State vector estimates are used to estimate continuous time points. The state vector is run recursively. More specifically, the tracking filter uses the state vector to predict the state from that point in time. To continuous time points The motion model of the state space model of the state vector. When operating in the first mode, the tracking filter uses the constant velocity motion model of the first state space model to perform predictions. Predictions may include time... Apply the state transition matrix to the state vector:

[0065] (Equation 3)

[0066] Get time The predicted state at time In time If no detection of the tracked target is received from the target detector 206, the prediction of the state vector becomes time. The estimation of the state vector at time. However, at time... (that is, at consecutive time points) Target detection is performed in the corresponding video frames. In this case, the tracking filter takes into account the detection of the target in consecutive video frames. At consecutive time points Update the predicted state vector This means that the estimation of the state vector is achieved by predicting the state vector. and detection This is achieved by combining or weighting the predictions together. When operating in the first mode, the first state-space model is used when updating the state vector. In particular, it defines how the predictions and detections are combined or weighted together. For example, when using a Kalman tracker, the tracking filter updates the state vector according to the following formula:

[0067] (Equation 4)

[0068] in, It is the gain of the filter, which depends on the state transition matrix in a manner known per se. Observation matrix and covariance matrix and Therefore, at the point in time when target detection is performed, the tracking filter uses the state vector from an earlier time point and the state vector from the current time point. The current time point is estimated by detecting the target in the corresponding current video frame. The state vector.

[0069] In the case of tracking several targets and / or when the target detector 206 detects multiple targets in a video frame, the multiple target detections are matched with trajectories to determine which target detection should be associated with which trajectory and used to update the trajectory's state vector. As is known in the art, this association problem can be solved by applying the Hungarian algorithm or another similar algorithm, which finds the optimal association by minimizing the global association cost used to associate the target detection with the trajectory.

[0070] return Figure 4 The flowchart shows that in step S04, the tracker 204 further receives motion vectors 214 from the video encoder 202, which encodes the sequence of video frames 210. For example... Figure 5 As shown, motion vectors can be received for each or nearly each of consecutive time points. In this way, when a target is detected from the target detector 206, the tracker 204 typically has motion vectors available for those time points.

[0071] Now assume that tracker 204 is about to estimate the current time point. state vector At the current point in time It receives information from target detector 206 regarding the current time point. The input corresponds to the target detection region in the current video frame. It further receives the motion vector of the current video frame from the video encoder 202. This is illustrated schematically in the diagram of the motion vector 602 of the current video frame 600. Figure 6The diagram illustrates this. For each pixel block used when encoding the current video frame 600, there exists a motion vector 602. The target detection region 604 where the target is detected is further shown in the current video frame 600. As can be seen, although it is understood that in other cases there might only be one motion vector in the target detection region 604, in this case there is more than one pixel block in the target detection region 604 and therefore more than one motion vector. As previously stated, each motion vector indicates the velocity within the pixel block. Therefore, one or more motion vectors 602 in the target detection region 604 indicate the velocity in the target detection region 604 in the current video frame 600. When there is more than one motion vector in the target detection region 604, a representative motion vector can be formed from the motion vectors in the target detection region 604, for example, by calculating the average of multiple motion vectors. The velocity indicated by the representative motion vector can then be used as a measure of the velocity in the target detection region 604 in the video frame 600.

[0072] At the point in time when tracker 204 receives both the detection 216 of target 104 from target detector 206 and the motion vector 214 from video encoder 202, tracker 204 decides whether to remain in the first mode or switch to the second mode. Therefore, in Figure 5 In the example, tracker 204 will be at the current time point To make this decision, tracker 204 determines an earlier time point in step S06. The velocity difference is between the velocity in the state vector and the velocity indicated by one or more motion vectors 602 in the target detection region 604 of the current video frame 600. Therefore, the velocity difference measures the deviation between the latest velocity estimated by the tracking filter and the velocity indicated by the motion vector from the video encoder 202.

[0073] The velocity indicated by one or more motion vectors 602 can be calculated via a rescaling operation, and sometimes via a redirection operation. More specifically, each of the one or more motion vectors 602 corresponds to a displacement between the current video frame and a reference frame used in encoding by the video encoder 202. Therefore, calculating the velocity indicated by one or more motion vectors 602 may include removing that bit to represent the time distance between the current video frame and the reference frame. For example, if the reference frame is the directly preceding video frame in a sequence of video frames and the frame rate is 30 frames per second, the time distance is 1 / 30th of a second. When there are several motion vectors in the object detection region 604, the displacement of a representative motion vector can be used in the calculation. Furthermore, note that the motion vectors point from a block in the current video frame to a block in the reference frame. Therefore, when the reference frame is the preceding video frame in the video sequence, the motion vectors point back to a block in the preceding frame. When the reference frame is the preceding video frame in the video sequence, calculating the velocity indicated by one or more motion vectors 602 may therefore further include reversing the direction of the motion vectors 602. However, when the reference frame is a later video frame in the sequence, it is not necessary to reverse the direction of the motion vectors.

[0074] When the video encoder 202 operates at a higher frame rate than the tracker 204, motion vectors from several frames can be accumulated to determine the time-dependent motion vectors. One or more motion vectors in the target detection region of the corresponding current video frame. Specifically, these can be accumulated (i.e., summed together) from previous time points. The motion vectors of all video frames encoded since then. In this case, the velocity indicated by one or more motion vectors can be calculated by dividing the accumulated motion vectors by the time point. and It is calculated based on the time difference between them.

[0075] Next, the decision point in step S07 will be at the current time point. The determined velocity difference is compared with a velocity threshold. If the velocity difference is equal to or less than the velocity threshold, tracker 204 decides to maintain the tracking filter operation in the first mode and therefore returns to step S02. If the velocity difference is otherwise greater than the velocity threshold, tracker 204 proceeds to step S08.

[0076] To set an appropriate velocity threshold, the publicly available tracking method can be applied with different values ​​of the velocity threshold to test video sequences representing the types of situations for which the method is intended to be used. The tracking performance can be evaluated for each velocity threshold, for example, by using a target tracking metric such as a high-order tracking accuracy (HOTA) measure. A suitable velocity threshold can then be selected as the one that leads to optimal tracking performance.

[0077] exist Figure 5 In the example, the tracking filter starts from time point Begin operating in the first mode. At the specified time... It receives target detection 216 from target detector 206 and motion vector 214 from video encoder 202, and determines the velocity difference. At time points... If the velocity difference is equal to or below the velocity threshold, the tracking filter is kept in the first mode, and the method returns to step S02. This process is then repeated for consecutive time points until a time point is reached where the determined velocity difference is alternatively above the velocity threshold. At that time point, the method alternatively proceeds to step S08.

[0078] In step S08, tracker 204 switches from operating the tracking filter in a first mode to operating the tracking filter in a second mode. In the second mode, the tracking filter estimates a state vector describing the position, velocity, and acceleration of the target in the video frame. This switch occurs at the current time point. .

[0079] The switch from the first mode to the second mode involves switching the state-space model used by the tracking filter from a first state-space model that models the target's position and velocity to a second state-space model that additionally models the target's acceleration. Therefore, this switch involves switching from a state vector that includes both position and velocity states to a state vector that additionally includes acceleration states. It also includes switching the motion model and the measurement model.

[0080] Switching can be achieved by extending the current time point using the acceleration state that describes the target's acceleration. This is implemented using a state vector. When expanding the state vector, it's possible to add acceleration states while maintaining the position and velocity estimates within the state vector. When adding acceleration states, they must be set to some initial value. Regarding how to... (The sentence is incomplete and requires further context to be fully translated.) There are several options for setting initial values. One option is to simply initialize the acceleration state to zero. Another option is to initialize the acceleration state to the acceleration corresponding to the velocity difference. The velocity difference is the target's velocity at an earlier point in time. Compared to the current time point The measure of how much the velocity difference changed between these points in time. Therefore, dividing the velocity difference by the time distance between these time points provides an estimate of the acceleration, which can be used to initialize the acceleration state. Another option is to initialize the acceleration state such that earlier time points... The state vector and acceleration state together generate the current time point. The predicted state vector is matched with the detected target in the current video frame. In this case, the prediction is made by applying the motion model of the second state-space model to an earlier time point extended by the acceleration state. The state vector is obtained. The acceleration state can be found by establishing and solving a system of equations. More specifically, this is achieved by applying the state transition matrix of the second state-space model to earlier time points extended by the unknown acceleration state. From the state vector, an expression for the predicted position of the target in the current video frame can be derived. By setting this expression equal to the target's position in the current video based on target detection, a system of equations can be obtained, which can be solved for unknown acceleration states. In the variant, it is not required to start from an earlier time point. The predicted state vector obtained from the state vector, together with the acceleration state, accurately matches the target detection in the current video frame. Conversely, considering the uncertainty of target detection, a certain degree of bias can be allowed. For example, another target can be set in the predicted state vector. This target can be detected in the current video frame by comparing the target detection with that of an earlier time point without acceleration state. The state vector generates the predicted position of the target. The more certain the target detection is, the closer the target is to the target detection, and vice versa.

[0081] As mentioned earlier, the tracking filter first determines the time point... State vector prediction of the current time point The state vector is then used to update the predicted state vector based on the detection of targets in the current video frame, in order to estimate the current time point. The state vector. In some embodiments, the switch from the first mode to the second mode occurs after prediction, but before updating the predicted state vector. In such an embodiment, the first state-space model is used for the current time point. The prediction, and a second state-space model including extended state vectors for updating the current time point. The predicted state vector. In this way, the second mode begins to influence the state vector estimation at the current time point, allowing the tracking filter to immediately begin adapting to sudden velocity changes. In other embodiments, the switch from the first mode to the second mode occurs given that at the current time point... After the target is detected and the predicted state vector is updated, in this case, the first state-space model is used for the current time point. The prediction and update of both, and the second state-space model including the extended state vector at the current time point. It is then used when estimating the state vector.

[0082] At the current time point After the switch, the tracking filter operates in the second mode. In the second mode, the tracking filter uses a second state-space model to perform predictions of the state vector between consecutive time points and updates the state vector based on target detection, as previously described. When operating in the second mode, tracker 204 monitors the acceleration in the estimated state vector in step S10. Acceleration can be monitored at each consecutive time point to determine when to switch back to the first mode. For example, it can monitor the magnitude of the acceleration. Figure 7 This diagram schematically illustrates the magnitude of acceleration in the estimated state vector when the tracking filter operates in second mode. Acceleration over time... It starts with an initial value, and then increases it, after which it is in time. Reduce to acceleration threshold T acc Below and remain at the threshold T acc The following, until time .

[0083] At each consecutive time point, tracker 204 checks in step S11 whether the acceleration has stabilized below an acceleration threshold. For example, when the acceleration in the state vector has been below the acceleration threshold for a predetermined time period, tracker 204 can determine that the acceleration has stabilized below the acceleration threshold. Figure 7 In the example, acceleration over time Decrease to acceleration threshold T acc The following. However, until it has fallen below the acceleration threshold T. acc Time to reach the predetermined time period T Only then is it determined that it has stabilized below the threshold T. acc The level.

[0084] If it is determined that the acceleration has not yet stabilized at a certain point in time, then tracker 204 continues to operate the tracking filter in the second mode and monitor the acceleration, i.e., returns to step S10. Conversely, if it is determined that the acceleration has stabilized at a certain point in time, tracker 204 proceeds to step S12, in which tracker 204 switches from operating the filter in the second mode to operating the tracking filter in the first mode, after which tracker 204 returns to step S02 of the method. Figure 5 and 7 In the example, the tracking filter therefore in time Switch from the second mode to the first mode.

[0085] The switch from the second mode to the first mode involves switching from the second state-space model to the first state-space model. This includes removing the acceleration states describing the target's acceleration from the state vector. However, the current estimates of position and velocity in the state vector can remain unchanged.

[0086] Similar to the velocity threshold, suitable values ​​for the acceleration threshold and / or time period T can be set by testing video sequences with publicly available methods at different acceleration thresholds and / or time periods and evaluating tracking performance using target tracking metrics. Then, suitable acceleration thresholds and / or time periods T can be selected as one or more values ​​that produce optimal tracking performance.

[0087] Figure 8 The invention is schematically illustrated in... Figure 3 The example demonstrates the effect in a given scenario. As previously mentioned, solid line 302 illustrates the target's true trajectory, and crosshair 304 illustrates the detected position of the target. Dashed line 802 represents the trajectory provided by tracker 204. Tracker 204 begins by operating a tracking filter in a first mode. In the first mode, it provides a smooth estimate of the target's position. At time point... It found that the velocity difference between the estimated velocity at a previous time point and the velocity indicated by one or more motion vectors in the target detection region of the current video frame exceeded a velocity threshold. As a result, it... The tracking filter is switched from operating in the first mode to the second mode. This makes the tracking filter more responsive to changes in the target's velocity (in this case, changes in direction). The tracking filter then operates in the second mode until a point in time is determined where the acceleration has stabilized at a low level. And the tracking filter at time point Switch back to operating in the first mode.

[0088] exist Figure 2 In the system shown, the video encoder 202, tracker 204 (i.e., means for tracking targets in a sequence of video frames) and target detector 206 may include circuitry configured to perform their respective functions.

[0089] Figure 9 The device 204 is shown in more detail. It includes circuitry 902, such as processing circuitry. It may further include a memory 904.

[0090] The circuitry may take the form of a processor, such as a microprocessor or central processing circuit, that, in conjunction with computer code instructions stored on a (non-transitory) computer-readable medium such as memory 904, causes device 204 to implement any of the methods disclosed herein. Memory 904 may be non-volatile memory. Examples of non-volatile memory include read-only memory, flash memory, ferroelectric RAM, magnetic computer storage devices, and optical discs.

[0091] In another implementation, the circuit is dedicated to and specifically designed to implement any of the methods described herein. The circuit may be in the form of one or more application-specific integrated circuits (ASICs) or one or more field-programmable gate arrays (FPGAs).

[0092] It should be understood that those skilled in the art can modify the above embodiments in various ways while still utilizing the advantages of the invention as shown in the above embodiments. Therefore, the invention should not be limited to the illustrated embodiments, but should be defined only by the appended claims. Furthermore, as those skilled in the art will understand, the illustrated embodiments can be combined.

Claims

1. A method for tracking a target in a sequence of video frames, comprising: The target (104) in a sequence of video frames (210) is tracked by operating a tracking filter in a first mode (S02), in which the tracking filter estimates a state vector describing the position and velocity of the target in the video frames. The tracking filter estimates the state vector of the current time point based on the state vector of an earlier time point and the detection (216) of the target in the current video frame corresponding to the current time point; One or more motion vectors (214) are received (S04) from the video encoder (202) that encodes the sequence of video frames, the one or more motion vectors (214) indicating the velocity in the target detection region (604) of the current video frame corresponding to the detection of the target; Determine (S06) the velocity difference between the velocity in the state vector at the earlier time point and the velocity indicated by the one or more motion vectors in the target detection region of the current video frame; and In response to the velocity difference being greater than a velocity threshold, at the current time point, the tracking filter is switched from operating in the first mode (S08) to operating in the second mode, wherein the tracking filter in the second mode estimates a state vector describing the position, velocity, and acceleration of the target in the video frame.

2. The method according to claim 1, wherein, Switching from operating the tracking filter in the first mode to operating the tracking filter in the second mode at the current time point includes expanding the state vector at the current time point using the acceleration state describing the acceleration of the target.

3. The method according to claim 2, wherein, The acceleration state is initialized at the current time point to the acceleration corresponding to the velocity difference.

4. The method according to claim 2, wherein, The acceleration state is initialized such that the state vector at the earlier time point, together with the acceleration state, generates the predicted state vector at the current time point, which matches the detection of the target in the current video frame.

5. The method according to claim 2, wherein, The extended state vector is used when estimating the state vector after the current time point.

6. The method according to claim 1, wherein, The tracking filter uses a motion model to predict state vectors from one point in time to consecutive points in time, wherein the motion model is a constant velocity motion model in the first mode and a constant acceleration motion model in the second mode.

7. The method according to claim 6, wherein, If the target is detected in consecutive video frames corresponding to the consecutive time points, the tracking filter updates the predicted state vector of the consecutive time points in view of the detection of the target in the consecutive video frames under each of the first mode and the second mode.

8. The method according to claim 1, wherein, The velocity indicated by one or more vectors in the target detection region of the current video frame is the velocity indicated by a representative motion vector of the one or more motion vectors.

9. The method according to claim 1, wherein, The one or more motion vectors correspond to the displacement between the current video frame and the reference frame used by the video encoder in the encoding, and the method further includes calculating the velocity indicated by the one or more motion vectors by removing the bit to the time distance between the current video frame and the reference frame.

10. The method of claim 1, further comprising: When the tracking filter is operating in the second mode, it monitors the acceleration in the estimated state vector, and In response to the acceleration in the estimated state vector having stabilized below the acceleration threshold (T) acc The level of the tracking filter is adjusted so that it switches from operating in the second mode to operating in the first mode.

11. The method according to claim 10, wherein, When the acceleration in the state vector has been below the acceleration threshold for a predetermined time period, the acceleration is determined to have stabilized at a level below the acceleration threshold.

12. The method according to claim 10, wherein, Switching from operating the tracking filter in the second mode to operating the tracking filter in the first mode includes removing the acceleration states describing the acceleration of the target from the state vector.

13. An apparatus for tracking a target in a sequence of video frames, comprising circuitry configured to: The target in a sequence of video frames (210) is tracked by operating a tracking filter in a first mode, wherein the tracking filter estimates a state vector describing the position and velocity of the target in the video frame. in, The tracking filter estimates the state vector at the current time point based on the state vector at an earlier time point and the detection (216) of the target in the current video frame corresponding to the current time point; One or more motion vectors (214) are received from a video encoder (202) that encodes a sequence of video frames, the one or more motion vectors (214) indicating the velocity in the target detection region of the current video frame corresponding to the detection of the target; Determine the velocity difference between the velocity in the state vector at the earlier time point and the velocity indicated by the one or more motion vectors in the target detection region of the current video frame; and In response to the velocity difference being greater than a velocity threshold, at the current time point, the tracking filter is switched from operating in the first mode to operating in the second mode, wherein the tracking filter in the second mode estimates a state vector describing the position, velocity, and acceleration of the target in the video frame.

14. A system for tracking a target in a sequence of video frames, comprising: A video encoder (202) is configured to encode a sequence of video frames and generate motion vectors (214) indicating the velocity in different regions of the video frames. A target detector (206) is configured to detect targets in a sequence of video frames; as well as The apparatus (204) for tracking a target in a sequence of video frames according to claim 13, wherein the apparatus receives motion vectors from the video encoder and receives target detection (216) from the target detector.

15. A non-transitory computer-readable medium comprising computer program code, which, when executed by a processing-capable device, causes the device to perform the method according to any one of claims 1 to 12.