Target tracking method, electronic device, storage medium, and program product

By combining convolutional neural networks and regression networks, target tracking is performed using template features of the current frame image. This solves the problems of occlusion and deformation in short-term tracking, achieves high-precision long-term target tracking, and reduces storage and computational overhead.

CN116433722BActive Publication Date: 2026-06-26CHENGDU CK TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
CHENGDU CK TECH
Filing Date
2023-03-10
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing short-term single-target tracking cannot solve the problem of target occlusion or deformation, while long-term single-target tracking has a large storage and computational overhead due to the need to maintain a large number of template images.

Method used

Convolutional neural networks are used for feature extraction and convolution operations, combined with regression networks for target recognition. The template features of the current frame image are used to convolve the search features to determine the feature response map, thereby achieving long-term stable tracking of the target, avoiding occlusion or disappearance problems, and reducing dependence on historical template images.

Benefits of technology

It improves the accuracy of target tracking, reduces storage and computational overhead, and achieves long-term stable target tracking.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116433722B_ABST
    Figure CN116433722B_ABST
Patent Text Reader

Abstract

The application provides a target tracking method, an electronic device, a storage medium and a program product, and relates to the technical field of target tracking. The method comprises the following steps: acquiring a video frame sequence, at least part of images in the video frame sequence comprising a target to be tracked; performing feature extraction on a current frame image by using a search network to obtain search features of the current frame image, the search features representing a position of the target to be tracked in the current frame image, and the search network being a convolutional neural network; performing convolution on the search features based on current template features corresponding to the current frame image to obtain a feature response map, the current template features representing a predicted target position of the target to be tracked in the current frame image, and the feature extraction network used for extracting the current template features and the search network being different neural networks; and inputting the feature response map into a regression network to obtain a tracking result, the tracking result comprising position information of the target to be tracked in the current frame image. The application can realize long-term and stable tracking of a target.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of target tracking technology, and more specifically, to a target tracking method, electronic device, storage medium, and program product. Background Technology

[0002] Target tracking is one of the fundamental tasks in the field of deep learning. Specifically, it involves stably acquiring the existence and geometric information of a specified target at each time step from a set of temporal images.

[0003] Existing target tracking methods are mainly divided into Short-Term Single Object Tracking (ST-SOT) and Long-Term Single Object Tracking (LT-SOT).

[0004] However, short-term single-target tracking cannot solve the problems of target occlusion, target deformation, or long-term tracking. Long-term single-target tracking requires the maintenance of a large number of template images, resulting in large storage and computational overhead. Summary of the Invention

[0005] The purpose of this application is to address the shortcomings of the prior art by providing a target tracking method, electronic device, storage medium, and program product to achieve long-term and stable target tracking.

[0006] To achieve the above objectives, the technical solutions adopted in the embodiments of this application are as follows:

[0007] In a first aspect, embodiments of this application provide a target tracking method, the method comprising:

[0008] Acquire a video frame sequence, wherein at least a portion of the images in the video frame sequence include the target to be tracked;

[0009] A search network is used to extract features from the current frame image to obtain search features of the current frame image. The search features represent the position of the target to be tracked in the current frame image. The search network is a convolutional neural network.

[0010] The search features are convolved based on the current template features corresponding to the current frame image to obtain a feature response map. The current template features represent the predicted target position of the target to be tracked in the current frame image. The feature extraction network used to extract the current template features is a different convolutional neural network from the search network.

[0011] The feature response map is input into a regression network to obtain a tracking result, which includes the position information of the target to be tracked in the current frame image.

[0012] Optionally, before convolving the search features based on the current template features corresponding to the current frame image to obtain the feature response map, the method further includes:

[0013] The current frame image labeled with the location information of the target to be tracked is used to extract features to determine the current template features; or, the previous frame image labeled with the location information of the target to be tracked is used to extract features to determine the current template features.

[0014] Optionally, the step of extracting features from the current frame image labeled with the location information of the target to be tracked, and determining the current template features, includes:

[0015] If the current frame image is the first frame image in the video frame sequence, a target detection algorithm is used to detect the target in the first frame image, and the position information of the target to be tracked is marked in the first frame image;

[0016] The initialization network is used to extract features from the first frame image labeled with the location information of the target to be tracked, so as to obtain the current template features.

[0017] Optionally, the step of extracting features from the current frame image labeled with the location information of the target to be tracked, and determining the current template features, includes:

[0018] If the current frame image is a frame image other than the first frame image in the video frame sequence, a target detection algorithm is used to detect the target in the current frame image, and the position information of the target to be tracked is marked in the current frame image;

[0019] A validation network is used to extract features from the current frame image labeled with the location information of the target to be tracked, to obtain the current template features.

[0020] Optionally, after using a verification network to extract features from the current frame image labeled with the location information of the target to be tracked, and obtaining the current template features, the method further includes:

[0021] Based on the current template features of the current frame image and the current template features of the first frame image, calculate the first visibility rate of the target to be tracked in the current frame image;

[0022] An update network is used to extract features from the current frame image with the first visibility rate and the location information of the target to be tracked labeled, and to update the current template features of the current frame image. The update network is a convolutional neural network and a recurrent neural network.

[0023] Optionally, the step of extracting features from the previous frame image labeled with the location information of the target to be tracked to determine the current template features includes:

[0024] Based on the tracking result of the previous frame image, the position information of the target to be tracked is marked in the previous frame image. The tracking result also includes: the second visibility rate of the target to be tracked in the previous frame image.

[0025] The update network is used to extract features from the previous frame image with the second visibility rate and the location information of the target to be tracked. The current template features are obtained by using an update network. The update network is a convolutional neural network and a recurrent neural network.

[0026] Optionally, the target neural network model includes: the search network, the feature extraction network of the current template features, and the regression network; the target neural network model is trained through the following steps;

[0027] Obtain a sequence of sample video frames, wherein at least a portion of the sample images in the video frame sequence include the target to be tracked, and each sample image is pre-annotated with the actual location information and the true visibility value of the target to be tracked;

[0028] Based on the sample images of each frame, the initial neural network model is used to output the sample location information and sample visibility rate of the target to be tracked in each sample image;

[0029] Based on the actual location information, true visibility value, sample location information, and sample visibility rate corresponding to each frame of sample image, a loss function for each frame of sample image is constructed.

[0030] The total loss function of the sample video frame sequence is obtained based on the loss function of each sample image frame;

[0031] The parameters of the initial neural network model are updated based on the total loss function until the model converges, thus obtaining the target neural network model.

[0032] Secondly, embodiments of this application also provide a target tracking device, the device comprising:

[0033] A video frame acquisition module is used to acquire a video frame sequence, wherein at least a portion of the images in the video frame sequence include the target to be tracked;

[0034] The search feature extraction module is used to extract features from the current frame image using a search network to obtain the search features of the current frame image. The search features represent the position of the target to be tracked in the current frame image. The search network is a convolutional neural network.

[0035] The feature convolution module is used to convolve the search features based on the current template features corresponding to the current frame image to obtain a feature response map. The current template features represent the predicted target position of the target to be tracked in the current frame image. The feature extraction network used to extract the current template features is a different convolutional neural network from the search network.

[0036] The target recognition module is used to input the feature response map into the regression network to obtain the tracking result, which includes the position information of the target to be tracked in the current frame image.

[0037] Optionally, prior to the feature convolution module, the apparatus further includes:

[0038] The current template feature acquisition module is used to extract features from the current frame image labeled with the location information of the target to be tracked, and determine the current template features; or, to extract features from the previous frame image labeled with the location information of the target to be tracked, and determine the current template features.

[0039] Optionally, the current template feature acquisition module includes:

[0040] The target location annotation unit is used to perform target detection on the first frame image using a target detection algorithm if the current frame image is the first frame image in the video frame sequence, and to annotate the location information of the target to be tracked in the first frame image.

[0041] The current template feature acquisition unit is used to extract features from the first frame image labeled with the location information of the target to be tracked using an initialization network, so as to obtain the current template features.

[0042] Optionally, the target location annotation unit is further configured to, if the current frame image is a frame image other than the first frame image in the video frame sequence, perform target detection on the current frame image using a target detection algorithm, and annotate the location information of the target to be tracked in the current frame image;

[0043] The current template feature acquisition unit is further configured to use a verification network to extract features from the current frame image labeled with the location information of the target to be tracked, thereby obtaining the current template features.

[0044] Optionally, after the current template feature acquisition unit, the device further includes:

[0045] The visibility calculation unit is used to calculate the first visibility rate of the target to be tracked in the current frame image based on the current template features of the current frame image and the current template features of the first frame image;

[0046] The current template feature update unit is used to extract features from the current frame image with the first visibility rate and the location information of the target to be tracked labeled using an update network, and update the current template features of the current frame image. The update network is a convolutional neural network and a recurrent neural network.

[0047] Optionally, the target location annotation unit is further configured to annotate the location information of the target to be tracked in the previous frame image based on the tracking result of the previous frame image, wherein the tracking result further includes: the second visibility rate of the target to be tracked in the previous frame image;

[0048] The current template feature acquisition unit is further configured to use an update network to extract features from the previous frame image of the second visibility rate and the location information of the target to be tracked, thereby obtaining the current template features. The update network is a convolutional neural network and a recurrent neural network.

[0049] Optionally, the target neural network model includes: the search network, the feature extraction network of the current template features, and the regression network; the target neural network model is trained through the following modules:

[0050] The sample video frame acquisition module is used to acquire a sample video frame sequence, wherein at least some sample images in the video frame sequence include the target to be tracked, and each sample image is pre-annotated with the actual location information and the true visibility value of the target to be tracked;

[0051] The sample target recognition module is used to output the sample location information and sample visibility rate of the target to be tracked in each frame of sample images using an initial neural network model;

[0052] The loss function construction module is used to construct the loss function of each frame of sample image based on the actual location information, the true value of visibility, the sample location information and the sample visibility corresponding to each frame of sample image;

[0053] The loss function summarization module is used to obtain the total loss function of the sample video frame sequence based on the loss function of each sample image frame;

[0054] The model update module is used to update the parameters of the initial neural network model based on the total loss function until the model converges to obtain the target neural network model.

[0055] Thirdly, embodiments of this application also provide an electronic device, including a memory, a processor, and a computer program stored in the memory, wherein the processor executes the computer program to implement the method described in any of the first aspects.

[0056] Fourthly, embodiments of this application also provide a computer-readable storage medium having a computer program / instructions stored thereon, wherein the computer program / instructions, when executed by a processor, implement the method described in any of the first aspects.

[0057] Fifthly, embodiments of this application also provide a computer program product, including a computer program / instructions, which, when executed by a processor, implement the method described in any of the first aspects.

[0058] The beneficial effects of this application are:

[0059] This application provides a target tracking method, electronic device, storage medium, and program product. It utilizes the current template features corresponding to the current frame image to perform convolution on the search features of the current frame image to determine a feature response map. Based on the feature response map, it tracks and identifies the target, determines the position information of the target to be tracked in the current frame image, and tracks the target by combining the current template features. This can avoid the problem of being unable to continue tracking due to the target being occluded or disappearing during long-term tracking, thus improving the target tracking accuracy. Moreover, it eliminates the need to maintain a large set of historical template images, reducing storage and computational overhead. Attached Figure Description

[0060] To more clearly illustrate the technical solutions of the embodiments of this application, the accompanying drawings used in the embodiments will be briefly introduced below. It should be understood that the following drawings only show some embodiments of this application and should not be regarded as a limitation of the scope. For those skilled in the art, other related drawings can be obtained based on these drawings without creative effort.

[0061] Figure 1 Flowchart of the target tracking method provided in the embodiments of this application Figure 1 ;

[0062] Figure 2 A schematic diagram of the search network provided in an embodiment of this application;

[0063] Figure 3 A schematic diagram of a regression network provided in an embodiment of this application;

[0064] Figure 4 This is a schematic diagram of the target to be tracked and the obstruction in an embodiment of this application;

[0065] Figure 5 Flowchart of the target tracking method provided in the embodiments of this application Figure 2 ;

[0066] Figure 6 A schematic diagram of the initialization network provided in an embodiment of this application;

[0067] Figure 7 Flowchart of the target tracking method provided in the embodiments of this application Figure 3 ;

[0068] Figure 8 A schematic diagram of the verification network provided in an embodiment of this application;

[0069] Figure 9 Flowchart of the target tracking method provided in the embodiments of this application Figure 4 ;

[0070] Figure 10 A schematic diagram of the update network provided in an embodiment of this application;

[0071] Figure 11 Flowchart of the target tracking method provided in the embodiments of this application Figure 5 ;

[0072] Figure 12 A flowchart illustrating the training steps of the target neural network model provided in this application embodiment;

[0073] Figure 13 This is a schematic diagram of the target tracking device provided in the embodiments of this application;

[0074] Figure 14 A schematic diagram of an electronic device provided in an embodiment of this application. Detailed Implementation

[0075] To make the objectives, technical solutions, and advantages of the embodiments of this application clearer, the technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are some embodiments of this application, but not all embodiments.

[0076] Therefore, the following detailed description of the embodiments of this application provided in the accompanying drawings is not intended to limit the scope of the claimed application, but merely to illustrate selected embodiments of the application. All other embodiments obtained by those skilled in the art based on the embodiments of this application without inventive effort are within the scope of protection of this application.

[0077] Furthermore, the terms "first," "second," etc., used in the specification, claims, and accompanying drawings of this application are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments of this application described herein can be implemented in orders other than those illustrated or described herein. Additionally, the terms "comprising" and "having," and any variations thereof, are intended to cover a non-exclusive inclusion; for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus.

[0078] It should be noted that, where there is no conflict, the features in the embodiments of this application can be combined with each other.

[0079] Existing target tracking methods are mainly divided into short-term single-target tracking and long-term single-target tracking. Short-term single-target tracking mainly includes: filtered tracking and Siamese Convolutional Neural Network (CNN). Filtered tracking has poor accuracy, and Siamese CNN has poor tracking performance when the target is occluded or deformed.

[0080] To address the problems encountered in short-term single-target tracking, a long-term single-target tracking method is proposed. This method requires using template images to track targets in the current frame. The maintenance methods for template images include: modifying template images on the fly, using a combination of fixed initialization template images and dynamic template images, and creating a historical template image set. However, modifying template images on the fly may lead to target tracking loss as the template images gradually deviate from the target specified in the original image. Using a combination of fixed initialization template images and dynamic template images can only record two states of the tracked target, and may fail to track the target when it undergoes severe deformation. While a historical template image set can cover various possibilities of the target, it requires calculating response values ​​for all images in the set, resulting in significant storage and computational overhead.

[0081] Based on the problems existing in the prior art, this application aims to provide a target tracking method, electronic device, storage medium, and program product. It utilizes the current template features corresponding to the current frame image to convolve the search features of the current frame image to determine a feature response map. The target is then tracked and identified based on the feature response map, determining the position information of the target in the current frame image. By combining the current template features with the target tracking, the problem of being unable to continue tracking due to the target being occluded or disappearing during long-term tracking can be avoided, thus improving target tracking accuracy. Furthermore, it eliminates the need to maintain a large set of historical template images, reducing storage and computational overhead.

[0082] Please refer to Figure 1 The following is a flowchart illustrating the target tracking method provided in the embodiments of this application. Figure 1 ,like Figure 1 As shown, the method may include:

[0083] S10: Acquire a video frame sequence, in which at least some images include the target to be tracked.

[0084] In this embodiment, the video frame sequence is a set of images that record the motion of the target to be tracked in a time sequence. In each frame of the set of images, there may be a complete target to be tracked, a partial target to be tracked, or no target to be tracked. The partial target to be tracked is occluded by an occluder, and the absence of a target to be tracked means that the target to be tracked has disappeared. In order to track the target to be tracked, it is necessary to track the target to be tracked in each frame of the video frame sequence. The current frame image can be any frame image in the video frame sequence.

[0085] S20: The search network is used to extract features from the current frame image to obtain the search features of the current frame image. The search features represent the position of the target to be tracked in the current frame image. The search network is a convolutional neural network.

[0086] In this embodiment, a pre-trained search network is used to extract features from the current frame image, converting the current frame image into a current frame feature image. Each pixel in the current frame feature image is used to record the search features of each pixel in the current frame image.

[0087] For example, please refer to Figure 2 This is a schematic diagram of the search network provided in an embodiment of this application, such as... Figure 2 As shown, the search network is a convolutional neural network, which can be composed of multiple convolutional layers (Conv). The search network is used to process data of size (3, H). x W x The current frame image x is converted to a size of (64, H) f(x) W f(x) The search feature f(x) is the search feature of f(x).

[0088] S30: Convolve the search features based on the current template features corresponding to the current frame image to obtain the feature response map. The current template features are the predicted target positions of the target to be tracked in the current frame image. The feature extraction network and the search network used to extract the current template features are different neural networks.

[0089] In this embodiment, the current template image corresponding to the current frame image is determined. The current template image is a frame image that is pre-annotated with the position information of the target to be tracked. A feature extraction network is used to extract features from the current template image so as to determine the predicted target position of the target to be tracked in the current frame image based on the position information of the target to be tracked in the current target image. The feature extraction network outputs the current template features to represent the predicted target position of the target to be tracked in the current frame image.

[0090] Using the current template feature as the convolution kernel, the search feature is convolved, and the cross-correlation between the current template feature and the search feature is calculated to obtain a feature response map of a preset size. The feature response map is used to indicate the response state of the search feature to the current template feature. The response value of each pixel in the feature response map is the response of the search feature to the current template feature.

[0091] For example, the current template feature f(z) has a size of (64, H). f(x) W f(x) By performing a convolution operation on the search feature f(x) using the current template feature f(z), i.e., h = f(z) * f(z), a feature of size (1, H) can be obtained. h W h The characteristic response map h).

[0092] S40: Input the feature response map into the regression network to obtain the tracking result, which includes the position information of the target to be tracked in the current frame image.

[0093] In this embodiment, a pre-trained regression network is used to perform target recognition on the feature response map to determine whether the feature response map contains the response information of the target to be tracked. If the feature response map does not contain the response information of the target to be tracked, it is determined that the target to be tracked does not contain the target to be tracked. If the feature response map contains the response information of the target to be tracked, it is determined that the target to be tracked contains the target to be tracked. The bounding box of the target to be tracked can be determined based on the response region in the feature response map. The position and second visibility of the target to be tracked in the current frame image are determined based on the parameters of the bounding box. The position information includes the position and second visibility of the target to be tracked in the current frame image.

[0094] For example, please refer to Figure 3 This is a schematic diagram of the regression network provided in the embodiments of this application, such as... Figure 3 As shown, the regression network is an N-dimensional convolutional neural network, which can be composed of multiple N-dimensional convolutional layers (ConvN). The regression network is used to process data of size (1, H). h W h The characteristic response map h is converted to a size of (1, H) y W y The target tensor y[p] pos [,l,t,r,b], where p pos The second visibility rate is denoted by l, t, r, and b, which represent the distances from the pixel (x, y) to the four edges of the bounding box: left, top, right, and bottom.

[0095] In some embodiments, the image to be tracked also includes an occluder. The visible area of ​​the target to be tracked is determined based on the bounding box of the target to be tracked and the bounding box of the occluder. A second visibility rate of the target to be tracked is calculated based on the visible area of ​​the target to be tracked and the total area of ​​the target to be tracked. The total area of ​​the target to be tracked can be determined from the first frame image.

[0096] For example, please refer to Figure 4 This is a schematic diagram of the target to be tracked and the obstruction in an embodiment of this application, such as... Figure 4 As shown, the visible area of ​​the target to be tracked, i.e. the object to be tracked, is S2. The total area of ​​the target to be tracked is the sum of the occlusion area S1, the visible area S2, and the area outside the screen S3. The total area can be determined based on the target to be tracked being completely within the screen viewport. Generally, the target to be tracked is completely within the screen viewport in the first frame image.

[0097] The target tracking method provided in the above embodiments utilizes the current template features extracted by the feature extraction network and the search features extracted by the search network to perform convolutional cross-correlation operations to determine the feature response map. Based on the feature response map, the target is tracked and identified to determine the position information of the target in the current frame image. The target is tracked based on the current template features and the search features. Each current template feature is used as a reference for the search features. This avoids the problem that the target cannot be tracked due to occlusion or disappearance during long-term tracking, which is caused by relying solely on the search features. This improves the target tracking accuracy. Furthermore, it eliminates the need to maintain a large set of template images, reducing storage and computational overhead.

[0098] The following examples illustrate possible implementations for obtaining the current template features.

[0099] In one possible implementation, before S30 convolves the search features based on the current template features corresponding to the current frame image to obtain the feature response map, the method may further include:

[0100] Feature extraction is performed on the current frame image labeled with the location information of the target to be tracked to determine the current template features.

[0101] In this embodiment, the current frame image labeled with the location information of the target to be tracked is used as the current template image. A feature extraction network is used to extract features from the current template image to determine the current template features. That is, the current template features extracted by the feature extraction network from the current frame image labeled with the location information of the target to be tracked are used as a reference for searching features. The location information of the target to be tracked labeled in the current frame image can be the location information of the target to be tracked determined by target detection of the current frame image using an external target detection algorithm.

[0102] In another possible implementation, before S30 convolves the search features based on the current template features corresponding to the current frame image to obtain the feature response map, the method may further include:

[0103] Feature extraction is performed on the previous frame image labeled with the location information of the target to be tracked to determine the current template features.

[0104] In this embodiment, since the motion of the target to be tracked in the video frame sequence is continuous in time, the previous frame image labeled with the position information of the target to be tracked can be used as the current template image. A feature extraction network is used to extract features from the current template image to determine the current template features. That is, the current template features extracted by the feature extraction network from the previous frame image labeled with the position information of the target to be tracked are used as a reference for searching features. The position information of the target to be tracked labeled in the previous frame image can be the tracking result output by tracking the previous frame image using the steps S20-S40 described above.

[0105] The following combination Figure 5 One possible implementation method for obtaining the current template features is described.

[0106] Please refer to Figure 5 The following is a flowchart illustrating the target tracking method provided in the embodiments of this application. Figure 2 ,like Figure 5 As shown, the process of extracting features from the current frame image labeled with the location information of the target to be tracked, and determining the current template features, may include:

[0107] S31: If the current frame image is the first frame image in the video frame sequence, use the target detection algorithm to detect the target in the first frame image and mark the position information of the target to be tracked in the first frame image.

[0108] S32: Use the initialization network to extract features from the first frame image labeled with the location information of the target to be tracked, and obtain the current template features.

[0109] In this embodiment, the target detection algorithm is an externally provided algorithm for detecting the location of a target. The externally provided target detection algorithm is used to detect the target in the first frame image, determine the location information of the target to be tracked in the first frame image, mark the location information of the target to be tracked in the first frame image, generate an initial template image, use an initialization network to extract features from the initial template image, and output the initial template features as the current template features.

[0110] The initialization network consists of a recurrent neural network and a convolutional neural network. In addition to outputting the initial template features of the first frame image as the current template features, the initialization network also outputs the latent variables of the first frame. The latent variables of the first frame are used to represent the motion state of the target to be tracked in the first frame image. The parameters for calculating the latent variables of the first frame by the initialization network are determined according to the training of the initialization network.

[0111] For example, please refer to Figure 6 This is a schematic diagram of the initialization network provided in an embodiment of this application, as shown below. Figure 6 As shown, the initialization network consists of alternating CNN and RNN (Recurrent Neural Network). The input to the initialization network is an initial template image zini scaled to (3, Hz, Wz), and the initial template features f(z) are... ini The dimensions of ) are (64, H) f(z) W f(z) In this embodiment, Hz = Wz = 127, H f(z) =W f(z) =15.

[0112] In one possible implementation, the RNN can be a gated recurrent neural network composed of gated recurrent units (GRUs).

[0113] The target tracking method provided in the above embodiments, for target tracking of the first frame image, uses an external target detection algorithm to annotate the position information of the target to be tracked in the first frame image, and uses an initialization network to extract features from the first frame image with the position information of the target to be tracked, so as to obtain the current template features of the first frame image, making the current template features of the first frame image and the search features of the first frame image constitute asymmetric features, so as to ensure that the position of the target to be tracked can be accurately searched from the first frame image, thereby improving the accuracy of target tracking.

[0114] The following combination Figure 7 Another possible implementation for obtaining the current template features is described.

[0115] Please refer to Figure 7 The following is a flowchart illustrating the target tracking method provided in the embodiments of this application. Figure 3 ,like Figure 7 As shown, the process of extracting features from the current frame image labeled with the location information of the target to be tracked, and determining the current template features, may include:

[0116] S33: If the current frame image is a frame image other than the first frame image in the video frame sequence, use the target detection algorithm to perform target detection on the current frame image and mark the position information of the target to be tracked in the current frame image.

[0117] S34: Use a validation network to extract features from the current frame image labeled with the location information of the target to be tracked, and obtain the current template features.

[0118] In this embodiment, an externally provided target detection algorithm is used to detect targets in frames other than the first frame, determining the position information of the target to be tracked in the other frames. The position information of the target to be tracked in the other frames is then labeled in the other frames to generate a priori template image. A validation network is then used to extract features from the priori template image, and the priori template features are output as the current template features. The validation network is composed of a convolutional neural network.

[0119] For example, please refer to Figure 8 This is a schematic diagram of the verification network provided in an embodiment of this application, as shown below. Figure 8 As shown, the validation network can be composed of multiple convolutional layers (Conv). The validation network is used to filter data of size (3, H). z W z The prior template image z) pri Convert to size (64, H) f(z) W f(z) Prior template features f(z) pri ).

[0120] The target tracking method provided in the above embodiments, for target tracking in other frame images, uses an external target detection algorithm to annotate the position information of the target to be tracked in other frame images, and uses a verification network to extract features from the other frame images annotated with the position information of the target to be tracked, so as to obtain the current template features of the other frame images, making the current template features of the other frame images and the search features of the other frame images constitute asymmetric features, so as to ensure that the position of the target to be tracked can be accurately searched from other frame images, thereby improving the accuracy of target tracking.

[0121] In one possible implementation, since the target to be tracked is completely within the screen viewport in the first frame image, the external detection algorithm can obtain accurate position information when detecting the target to be tracked in the first frame image. However, as the target to be tracked moves, when the position of the target to be tracked drifts, the position information of the target to be tracked detected by the external detection algorithm in other frames may be inaccurate. Therefore, it is necessary to update the current template features of other frames obtained by using the verification network.

[0122] The following combination Figure 9 One possible implementation for updating the current template features of other frame images is described.

[0123] Please refer to Figure 9 The following is a flowchart illustrating the target tracking method provided in the embodiments of this application. Figure 4 ,like Figure 9 As shown, after using a validation network to extract features from the current frame image labeled with the location information of the target to be tracked in S34 above, and obtaining the current template features, the method may further include:

[0124] S35: Calculate the first visibility rate of the target to be tracked in the current frame image based on the current template features of the current frame image and the current template features of the first frame image.

[0125] In this embodiment, the prior template features f(z) of other frame images are calculated. pri After that, the initial template features f(z) of the first frame image can be used as a basis. ini Prior template features f(z) of other frame images pri ), calculate the first visibility of the target to be tracked in other frame images.

[0126] For example, using the initial template features f(z) of the first frame image. ini ) is used as the convolution kernel to process the prior template features f(z) pri Perform convolution operations to obtain the first visibility p of other frame images. pri .

[0127] S36: The update network is used to extract features from the current frame image with the first visibility rate and the location information of the target to be tracked. The current template features of the current frame image are updated, and the update network is a convolutional neural network and a recurrent neural network.

[0128] In this embodiment, since the motion of the target to be tracked in the video frame sequence is continuous, the change in the first visibility of the target to be tracked in the image is also continuous. The recurrent neural network of the update network has hidden variables to represent the cumulative motion state of the target to be tracked in all frames before the current frame. The first visibility of the target to be tracked in other frames and other frames labeled with the position information of the target to be tracked are input into the update network. The update network performs feature extraction based on the first visibility of the target to be tracked, the position information of the target to be tracked, and the cumulative motion state of the target to be tracked, and outputs the updated current template features.

[0129] Due to the first visibility p pri It is a scalar, used to determine the first visibility p. pri The characteristics require first determining the visibility rate p. pri Convert to a tensor, then combine this tensor with the prior template image z. pri After concatenation, the data is input into the update network to obtain the prior template features f(z). pri The updated template feature f(p) pri ,z pri () as the current template feature after other frame images are updated.

[0130] It should be noted that, based on the first visibility rate p pri When updating the prior template features, the updated template features can represent the situation where the target to be tracked is occluded or disappears in other frame images, so that the search features of the current frame can be convolved according to the updated template features to accurately determine the position of the target to be tracked in the current frame image.

[0131] For example, please refer to Figure 10 This is a schematic diagram of the update network provided in an embodiment of this application, as shown below. Figure 10 As shown, the update network consists of alternating CNNs and RNNs, with the first visibility p... pri Convert to tensor (1, H) z W z ), and compared with the prior template image z pri By splicing, the resulting dimensions are (1+3,H) z W z The tensor of )[p pri ,z pri ], will tensor [p pri ,z pri Input the updated network to obtain a size of (64, H) f(z) W f(z) The updated template feature f(p) pri ,z pri ).

[0132] It should be noted that the weight parameters and latent variable parameters of the recurrent neural network in the update network and the recurrent neural network in the initialization network are shared. The latent variables of the first frame image output by the recurrent neural network in the initialization network participate in the update of the current template features of the second frame image in the update network. The update network outputs the latent variables of the current frame image based on the latent variables of the previous frame image. The latent variables of each frame image in the update network participate in the update of the current template features of the next frame image.

[0133] The target tracking method provided in the above embodiments calculates the first visibility rate of the target in other frame images based on the current template features output by the verification network and the current template features of the first frame image. An update network updates the current template features of other frame images based on the first visibility rate and the position of the target in other frame images. Since the update network includes a convolutional neural network, the position features of the target can be propagated in time and space. The updated current template features and the search features of other frame images constitute asymmetric spatiotemporal features to ensure that the position of the target can be accurately searched from other frame images, thereby improving the accuracy of target tracking. Furthermore, due to the propagation of position features in time and space, the tracking of the target can be avoided due to occlusion or disappearance of the target, ensuring long-term stable target tracking.

[0134] The following combination Figure 11 Another possible implementation for obtaining the current template features is explained.

[0135] Please refer to Figure 11 The following is a flowchart illustrating the target tracking method provided in the embodiments of this application. Figure 5 ,like Figure 11 As shown, the process of extracting features from the current frame image labeled with the location information of the target to be tracked, and determining the current template features, may include:

[0136] S37: Based on the tracking results of the previous frame image, mark the position information of the target to be tracked in the previous frame image. The tracking results also include the second visibility rate of the target to be tracked in the previous frame image.

[0137] S38: The update network is used to extract features from the previous frame image with the second visibility rate and the location information of the target to be tracked, so as to obtain the current template features. The update network is a convolutional neural network and a recurrent neural network.

[0138] In this embodiment, since the external target detection algorithm is inefficient and cannot provide a template image with the location information of the target to be tracked for each frame, the target tracking method provided in this application can be used to identify the target in the previous frame image by tracking the target in the previous frame image, and the location of the target to be tracked can be marked on the previous frame image to generate a posterior template image z. pos .

[0139] As mentioned above, the location information of the target to be tracked output by the regression network also includes the second visibility rate of the target in the image, which is the second visibility rate p of the previous frame image. pos Convert to a tensor, and the posterior template image z pos After concatenation, the data is input into the update network, and the posterior template features f(p) are then processed. pos ,z pos () is used as the current template feature of the previous frame image.

[0140] It should be noted that the second visibility p of the previous frame image pos and posterior template image z pos The input is fed into the update network, where the latent variables of the input to the update network are the latent variables of the output of the previous frame.

[0141] The target tracking method provided in the above embodiments, for target tracking in other frames, uses an update network to extract features from the previous frame image labeled with the location information of the target to be tracked and the second visibility rate of the target to be tracked in the previous frame image, to obtain the current template features of the previous frame image. This makes the current template features of the previous frame image and the search features of the current frame image constitute asymmetric spatiotemporal features, so as to ensure that the location of the target to be tracked can be accurately searched from other frames image, improve the accuracy of target tracking, and ensure long-term stable target tracking.

[0142] The target tracking method provided in the above embodiments is based on a pre-trained target neural network model. The target neural network model consists of a search network, a feature extraction network for the current template features, and a regression network. The feature extraction network for the current template features may include an initialization network, a validation network, and an update network.

[0143] The following combination Figure 12 This paper describes one possible implementation of the neural network model for training the target.

[0144] Please refer to Figure 12 This is a flowchart illustrating the training steps of the target neural network model provided in an embodiment of this application, as shown below. Figure 12 As shown, the training steps for the target neural network model may include:

[0145] S51: Obtain a sequence of sample video frames. At least some of the sample images in the video frame sequence include the target to be tracked. The actual location information and the true visibility value of the target to be tracked are pre-annotated in each sample image.

[0146] In this embodiment, the sample video frame sequence is a set of sample images that record the motion of the target to be tracked in a time series. In each sample image of the set of sample images, there may be a complete target to be tracked, a partial target to be tracked, or no target to be tracked. The presence of a partial target to be tracked means that the target to be tracked is occluded by an occluder, and the absence of a target to be tracked means that the target to be tracked has disappeared.

[0147] The actual location information of the target to be tracked is manually labeled in each frame of sample images, and the true visibility value of the target to be tracked in each frame of sample images is calculated based on the actual location information and the complete size information of the target to be tracked.

[0148] S52: Based on the sample images of each frame, the initial neural network model is used to output the sample position information and sample visibility rate of the target to be tracked in each sample image.

[0149] In this embodiment, each frame of sample images is input into the initial search network in the initial neural network model for feature extraction, generating sample search features for each frame of sample images, obtaining the current template features of each frame of sample images, and using the current template features as convolution kernels to convolve the sample search features to obtain a sample feature response map. The sample feature response map is then input into the initial regression network in the initial neural network model. The initial regression network performs target recognition based on the sample feature response map and outputs the sample position information and sample visibility rate of the target to be tracked in each frame of sample images.

[0150] In one possible implementation, if the current frame sample image is the first frame sample image, the initialization network in the initial neural network model can be used to extract the first frame sample image labeled with the location information of the target to be tracked to obtain the current template features of the sample.

[0151] If the current frame sample image is another frame sample image, the initial verification network in the initial neural network model can be used to extract the current frame sample image labeled with the location information of the target to be tracked to obtain the current template features of the sample. Furthermore, the initial update network in the initial neural network model can be used to update the current template features of the current frame sample image.

[0152] In some embodiments, the update network in the initial neural network model can be used to extract features based on the sample position information of the target to be tracked in the previous frame sample image and the sample visibility rate of the target to be tracked in the previous frame image to determine the current template features of the sample.

[0153] For example, a convolutional layer can be placed between the initial search network and the initial regression network of the initial neural network model. This convolutional layer uses the current template features of the sample as the convolution kernel to convolve the sample's search features. One input to this convolutional layer is the output of the initial search network, and the other input is the output of the initialization network, the initial validation network, or the initial update network.

[0154] S53: Construct a loss function for each frame of sample images based on the actual location information, true visibility value, sample location information, and sample visibility.

[0155] In this embodiment, the positional deviation between the sample position information and the actual position information is calculated, the visibility deviation between the sample visibility rate and the true visibility rate is calculated, and a loss function for each frame of sample image is constructed based on geometric weights and positional deviation, probability weights and visibility deviation.

[0156] For example, the loss function can be expressed as:

[0157]

[0158] Where, p pri The first visibility rate, p, is calculated based on the current template features of the samples from other frames output by the initial validation network and the current template features of the samples from the first frame output by the initialization network. pos It is the second visibility rate of the initial regression network output. Let (l, t, r, b) be the true visibility value, and (l, t, r, b) be the sample location information output by the initial regression network. It is the actual location information marked in the sample image, where The coordinates of the top-left vertex of the bounding box can be calculated based on the bounding box of the target object marked in each frame of sample images. The coordinates of the bottom right vertex are This represents the distance from pixel (x, y) to the left, top, right, and bottom edges of the bounding box.

[0159] For example, the calculation formula can be:

[0160]

[0161]

[0162] Where s is the rate of change of scale, and an example can be... After the initial regression network determines the bounding box of the target to be tracked, (l,t,r,b) can also be determined according to the above calculation formula.

[0163] S54: Obtain the total loss function of the sample video frame sequence based on the loss function of each sample image frame.

[0164] In this embodiment, training is performed on each sample image in the sample video frame sequence, and the loss function of each sample image is used to obtain the total loss function of the sample video frame sequence. The total loss function L is:

[0165]

[0166] S55: Update the parameters of the initial neural network model based on the total loss function until the model converges, and obtain the target neural network model.

[0167] In this embodiment, based on the total loss function, an optimization algorithm is used to update the parameters of the initial neural network model. The adjusted neural network model is then used for target tracking again. The total loss function is calculated based on the output results. If the loss function value meets a preset stopping condition, the target neural network model is obtained. Updating the parameters of the initial neural network model includes updating the weight parameters of the search network, validation network, and regression network, as well as updating the weight parameters and latent variable parameters of the initialization and updating networks. For example, the optimization algorithm can be an Adaptive Moment Estimation (Adam) optimizer.

[0168] It should be noted that the structures and weights of the CNNs in the search network, validation network, regression network, initialization network, and update network are different, while the weight parameters and latent variable parameters of the RNNs in the initialization network and update network can be shared.

[0169] It should also be noted that, since the external target detection algorithm is inefficient and cannot provide sample template images with the location information of the target to be tracked for each frame of sample images, when training the model using the previous frame image with the location information of the target to be tracked, the validation network is not called. Instead, the validation network is called according to the number of frames between the target detection algorithm and the target to be tracked, which can be 30 frames.

[0170] Based on the embodiments of the target tracking method described above, this application also provides a target tracking device. Please refer to... Figure 13 This is a schematic diagram of the target tracking device provided in the embodiments of this application, as shown below. Figure 13 As shown, the device may include:

[0171] The video frame acquisition module 10 is used to acquire a video frame sequence, wherein at least some of the images in the video frame sequence include the target to be tracked.

[0172] The search feature extraction module 20 is used to extract features from the current frame image using a search network to obtain the search features of the current frame image. The search features represent the position of the target to be tracked in the current frame image. The search network is a convolutional neural network.

[0173] The feature convolution module 30 is used to convolve the search features based on the current template features corresponding to the current frame image to obtain a feature response map. The current template features represent the predicted target position of the target to be tracked in the current frame image. The feature extraction network and the search network used to extract the current template features are different convolutional neural networks.

[0174] The target recognition module 40 is used to input the feature response map into the regression network to obtain the tracking result, which includes the position information of the target to be tracked in the current frame image.

[0175] Optionally, prior to the feature convolution module 30, the device further includes:

[0176] The current template feature acquisition module is used to extract features from the current frame image labeled with the location information of the target to be tracked, and determine the current template features; or, to extract features from the previous frame image labeled with the location information of the target to be tracked, and determine the current template features.

[0177] Optionally, the current template feature acquisition module includes:

[0178] The target location annotation unit is used to perform target detection on the first frame image if the current frame image is the first frame image in the video frame sequence, and to annotate the location information of the target to be tracked in the first frame image by using a target detection algorithm.

[0179] The current template feature acquisition unit is used to extract features from the first frame image labeled with the location information of the target to be tracked using the initialization network, so as to obtain the current template features.

[0180] Optionally, the target location annotation unit is also used to perform target detection on the current frame image using a target detection algorithm if the current frame image is another frame image in the video frame sequence other than the first frame image, and to annotate the location information of the target to be tracked in the current frame image;

[0181] The current template feature acquisition unit is also used to extract features from the current frame image labeled with the location information of the target to be tracked by the verification network to obtain the current template features.

[0182] Optionally, after the current template feature acquisition unit, the device further includes:

[0183] The visibility calculation unit is used to calculate the first visibility of the target to be tracked in the current frame image based on the current template features of the current frame image and the current template features of the first frame image.

[0184] The current template feature update unit is used to extract features from the current frame image with the first visibility rate and the location information of the target to be tracked, and update the current template features of the current frame image using the update network. The update network is a convolutional neural network and a recurrent neural network.

[0185] Optionally, the target location annotation unit is also used to annotate the location information of the target to be tracked in the previous frame image based on the tracking result of the previous frame image. The tracking result also includes: the second visibility rate of the target to be tracked in the previous frame image.

[0186] The current template feature acquisition unit is also used to extract features from the previous frame image of the second visibility rate and the location information of the target to be tracked by the update network to obtain the current template features. The update network is a convolutional neural network and a recurrent neural network.

[0187] Optionally, the target neural network model includes: a search network, a feature extraction network for the current template features, and a regression network; the target neural network model is trained through the following modules:

[0188] The sample video frame acquisition module is used to acquire a sequence of sample video frames. At least some of the sample images in the video frame sequence include the target to be tracked. The actual location information and visibility true value of the target to be tracked are pre-annotated in each sample image.

[0189] The sample target recognition module is used to output the sample location information and sample visibility rate of the target to be tracked in each frame of sample images using an initial neural network model;

[0190] The loss function construction module is used to construct the loss function for each frame of sample image based on the actual location information, the true value of visibility, the sample location information, and the sample visibility.

[0191] The loss function summary module is used to obtain the total loss function of the sample video frame sequence based on the loss function of each sample image frame;

[0192] The model update module is used to update the parameters of the initial neural network model based on the total loss function until the model converges and the target neural network model is obtained.

[0193] The above-described device is used to execute the method provided in the foregoing embodiments, and its implementation principle and technical effect are similar, so they will not be described again here.

[0194] These modules can be one or more integrated circuits configured to implement the above methods, such as one or more Application Specific Integrated Circuits (ASICs), one or more microprocessors, or one or more Field Programmable Gate Arrays (FPGAs). Alternatively, when a module is implemented using processing element scheduler code, the processing element can be a general-purpose processor, such as a Central Processing Unit (CPU) or other processor capable of calling program code. Furthermore, these modules can be integrated together as a system-on-a-chip (SOC).

[0195] Please refer to Figure 14 This is a schematic diagram of the electronic device provided in the embodiments of this application, such as... Figure 14 As shown, the electronic device 100 includes: a memory 101, a processor 102, and a computer program stored on the memory 101. The processor 102 executes the computer program to implement the target tracking method of any of the above embodiments.

[0196] In one possible implementation, this application also provides a computer-readable storage medium storing a computer program / instructions thereon, which, when executed by a processor, implements the target tracking method of any of the above embodiments.

[0197] In one possible implementation, this application also provides a computer program product, including a computer program / instructions, which, when executed by a processor, implement the target tracking method of any of the above embodiments.

[0198] In the several embodiments provided in this application, it should be understood that the disclosed apparatus and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection between apparatuses or units may be electrical, mechanical, or other forms.

[0199] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.

[0200] Furthermore, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or in a combination of hardware and software functional units.

[0201] The integrated units implemented as software functional units described above can be stored in a computer-readable storage medium. These software functional units, stored in a storage medium, include several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) or processor to execute some steps of the methods described in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0202] The above are merely specific embodiments of this application, but the scope of protection of this application is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in this application should be included within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.

Claims

1. A target tracking method, characterized in that, The method includes: Acquire a video frame sequence, wherein at least a portion of the images in the video frame sequence include the target to be tracked; A search network is used to extract features from the current frame image to obtain search features of the current frame image. The search features represent the position of the target to be tracked in the current frame image. The search network is a convolutional neural network. Convolution is performed on the search features based on the current template features corresponding to the current frame image to obtain a feature response map. The current template features represent the predicted target position of the target to be tracked in the current frame image. The feature extraction network used to extract the current template features is a different neural network from the search network. The feature response map is input into a regression network to obtain a tracking result, which includes the position information of the target to be tracked in the current frame image. In this process, the current template features of the previous frame image are obtained by using an update network to extract features from the previous frame image labeled with the location information of the target to be tracked and the second visibility rate of the target to be tracked in the previous frame image. The update network is a convolutional neural network and a recurrent neural network.

2. The method as described in claim 1, characterized in that, Before convolving the search features based on the current template features corresponding to the current frame image to obtain the feature response map, the method further includes: Feature extraction is performed on the current frame image labeled with the location information of the target to be tracked to determine the current template features; or, Feature extraction is performed on the previous frame image labeled with the location information of the target to be tracked to determine the current template features.

3. The method as described in claim 2, characterized in that, The step of extracting features from the current frame image labeled with the location information of the target to be tracked, and determining the current template features, includes: If the current frame image is the first frame image in the video frame sequence, a target detection algorithm is used to detect the target in the first frame image, and the position information of the target to be tracked is marked in the first frame image; The initialization network is used to extract features from the first frame image labeled with the location information of the target to be tracked, so as to obtain the current template features.

4. The method as described in claim 2, characterized in that, The step of extracting features from the current frame image labeled with the location information of the target to be tracked, and determining the current template features, includes: If the current frame image is a frame image other than the first frame image in the video frame sequence, a target detection algorithm is used to detect the target in the current frame image, and the position information of the target to be tracked is marked in the current frame image; A validation network is used to extract features from the current frame image labeled with the location information of the target to be tracked, to obtain the current template features.

5. The method as described in claim 4, characterized in that, After extracting features from the current frame image labeled with the location information of the target to be tracked using a verification network to obtain the current template features, the method further includes: Based on the current template features of the current frame image and the current template features of the first frame image, calculate the first visibility rate of the target to be tracked in the current frame image; The update network is used to extract features from the current frame image with the first visibility rate and the location information of the target to be tracked, and to update the current template features of the current frame image.

6. The method as described in claim 2, characterized in that, The step of extracting features from the previous frame image labeled with the location information of the target to be tracked, and determining the current template features, includes: Based on the tracking result of the previous frame image, the position information of the target to be tracked is marked in the previous frame image, and the tracking result also includes: the second visibility rate of the target to be tracked in the previous frame image; The update network is used to extract features from the previous frame image with the second visibility rate and the location information of the target to be tracked, to obtain the current template features.

7. The method as described in claim 1, characterized in that, The target neural network model includes: the search network, the feature extraction network of the current template features, and the regression network; the target neural network model is trained through the following steps; Obtain a sequence of sample video frames, wherein at least a portion of the sample images in the video frame sequence include the target to be tracked, and each sample image is pre-annotated with the actual location information and the true visibility value of the target to be tracked; Based on the sample images of each frame, the initial neural network model is used to output the sample location information and sample visibility rate of the target to be tracked in each sample image; Based on the actual location information, true visibility value, sample location information, and sample visibility rate corresponding to each frame of sample image, a loss function for each frame of sample image is constructed. The total loss function of the sample video frame sequence is obtained based on the loss function of each sample image frame; The parameters of the initial neural network model are updated based on the total loss function until the model converges, thus obtaining the target neural network model.

8. An electronic device comprising a memory, a processor, and a computer program stored in the memory, characterized in that, The processor executes the computer program to implement the method according to any one of claims 1-7.

9. A computer-readable storage medium having a computer program / instructions stored thereon, characterized in that, When the computer program / instructions are executed by the processor, they implement the method described in any one of claims 1-7.

10. A computer program product comprising a computer program / instructions, characterized in that, When the computer program / instructions are executed by the processor, they implement the method described in any one of claims 1-7.