An end-to-end multi-target tracking method based on query bootstrap enhancement and related apparatus
By using query bootstrapping enhancement, the query vector is enhanced with the prediction results of the previous frame image, which simplifies the model structure, resolves the conflict between detection and tracking in end-to-end multi-target tracking, and improves the accuracy and effectiveness of multi-target tracking.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- XI AN JIAOTONG UNIV
- Filing Date
- 2024-04-17
- Publication Date
- 2026-06-19
AI Technical Summary
In existing query-based end-to-end multi-target tracking methods, there is a conflict between detection and tracking tasks, which increases model complexity, and existing solutions often introduce additional complexity or hyperparameter tuning.
The query bootstrap enhancement method is adopted. The current frame image is used to extract and encode features through a pre-trained attention-based multi-target tracking model. The query vector is enhanced by the prediction results of the previous frame image. The final classification and regression results are generated by combining feature decoding and the tracking head, which simplifies the model structure.
Without increasing computational complexity and the number of parameters, it improves the accuracy and performance of multi-target tracking, resolves the conflict between detection and tracking, and achieves better target representation capabilities.
Smart Images

Figure CN118247312B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of computer vision technology, specifically relating to an end-to-end multi-target tracking method and related apparatus based on query bootstrapping enhancement. Background Technology
[0002] Multi-object tracking is a highly anticipated research problem in computer vision, with broad application prospects in various systems such as autonomous driving and robot navigation. In recent years, artificial intelligence based on deep learning technology has developed rapidly, achieving breakthroughs in areas such as object detection, image classification, and speech recognition. As one of the hot applications of computer vision, multi-object tracking has also benefited from the continuous innovation of deep neural network technology, resulting in significantly improved tracking accuracy.
[0003] Common query-based end-to-end multi-object tracking methods are generally similar to query-based end-to-end object detection methods. They typically include feature extractors, feature encoders, feature decoders, and a tracking head, relying on learnable interpreter query vectors to represent different targets. This network design has achieved good results in multi-object tracking tasks. However, because it uses the same feature interpreter for both detection and tracking, conflicts arise between the two tasks. Specifically, the attention of the tracking interpreter query vector sometimes diverges to newly discovered targets. Several solutions to this conflict have emerged, such as introducing additional auxiliary detectors to make the tracking model focus more on target association (tracking) rather than detection, or introducing memory mechanisms to allow the model to remember target features over a longer period, thereby improving performance. However, these approaches introduce significant complexity to the algorithm, especially in preprocessing. For example, introducing additional auxiliary detectors requires generating predicted bounding boxes generated by the detectors before training, and introducing memory mechanisms requires setting model memory update parameters. Summary of the Invention
[0004] To address the problems existing in the prior art, the present invention aims to provide an end-to-end multi-target tracking method and related apparatus based on query bootstrapping enhancement. The tracking method of the present invention is a simple end-to-end multi-target tracking method based on query bootstrapping enhancement, which can easily obtain a more excellent multi-target tracker.
[0005] The technical solution adopted in this invention is as follows:
[0006] An end-to-end multi-target tracking method based on query bootstrapping enhancement includes:
[0007] The current frame image is processed using a pre-trained attention-based multi-object tracking model to obtain the prediction result of the current frame image;
[0008] The process of processing the current frame image using a pre-trained attention-based multi-object tracking model includes:
[0009] Feature images are obtained by extracting features from the current frame image pairs at different resolution scales;
[0010] The feature image is encoded using feature encoding to obtain a feature encoding vector;
[0011] The learnable interpretable query vector of the pre-trained attention-based multi-target tracking model is enhanced by using the feature encoding vector and the prediction result of the pre-trained attention-based multi-target tracking model on the previous frame image to obtain the enhanced query vector; the enhanced query vector is added to the joint query vector of the pre-trained attention-based multi-target tracking model to obtain the final query vector.
[0012] The feature encoding vector and the final query vector are used to obtain the output vector through feature decoding;
[0013] The tracking head is used to interpret the output vector into classification and regression results, and the prediction results of the current frame image are obtained.
[0014] Preferred, pre-trained attention-based multi-object tracking models are as follows:
[0015] F(X N×H×W ;θ), θ={θ1,θ2,…,θ i …,θ n}
[0016] Where θ represents the learnable parameter, θ i Let X represent the parameters of the i-th layer, and n represent the total number of layers in the attention-based multi-object tracking model, i = 1...n. N×H×W This represents the input image, where N, H, and W are the dimensions, height, and width of the input image, respectively.
[0017] The loss function of a pre-trained attention-based multi-object tracking model is:
[0018]
[0019] Where pre and label represent the predicted value and the true value, respectively, L cls ,L cardinality ,L giou These are classification loss, cardinality loss, and overlap loss, respectively. pre ,l label b represents the predicted value and the true value of the category label, respectively. pre ,b labelrepresents the predicted value and ground truth value of the regression box, respectively; IoU represents the overlap calculation; C represents the minimum bounding box calculation; m represents one of the images in each batch during training; and M represents each batch of images during training.
[0020] Preferably, the process of extracting features from the current frame image pairs at different resolution scales to obtain feature images includes:
[0021] Multiple convolution operations are used to downsample the current frame image and increase the channel dimension, while simultaneously extracting features from the current frame image to obtain a feature image.
[0022] Preferably, the process of encoding the feature image through feature encoding to obtain the feature encoding vector includes:
[0023] After straightening the feature image, a multi-layer self-attention mechanism is used to abstract and encode the features, and finally a feature encoding vector is generated.
[0024] Preferably, the learnable interpretable query vector of the pre-trained attention-based multi-object tracking model is enhanced by using the feature encoding vector and the prediction result of the pre-trained attention-based multi-object tracking model on the previous frame image. The calculation formula for the enhanced query vector is as follows:
[0025]
[0026] Where θ3 is a parameter in feature encoding. For the generated enhanced query vector, Q (n+k)×dim This represents the learnable interpretable query vector of a pre-trained attention-based multi-object tracking model, where n represents the number of codes that can be interpreted, k represents the number of targets detected or tracked in the previous frame, dim represents the dimension of the output code, and F3 is the query vector augmentation network. Let N′, W′, and H′ represent the feature image, where N′, W′, and H′ represent the dimensions, height, and width of the feature image.
[0027] Preferably, the formula for calculating the output vector by feature decoding from the feature encoding vector and the final query vector is as follows:
[0028]
[0029] in, The output vector is F4, which is the feature decoding network. For feature encoding vector, To enhance the query vector, θ4 is a parameter in the feature encoding, n represents the number of codes that can be interpreted, k represents the number of targets detected or tracked in the previous frame, dim represents the dimension of the output encoding, and N′, W′, H′ This represents the dimensions, height, and width of the feature image.
[0030] Preferably, the process of interpreting the output vector into classification and regression results using the tracking head includes: using a linear layer to interpret the output vector, and finally generating the classification and regression bounding box predictions required for tracking, as shown in the following formula:
[0031]
[0032]
[0033] Where θ5 and θ6 are the parameters of the classification head and regression head in the tracking head, respectively. These represent the generated classification and regression bounding box predictions, respectively. F5 is the classification header in the tracking header, and F6 is the regression header in the tracking header. The output vector is defined as follows: n represents the number of codes that can be interpreted, k represents the number of targets detected or tracked in the previous frame, and dim represents the dimension of the output code.
[0034] This invention also provides an end-to-end multi-target tracking system based on query bootstrapping enhancement, comprising:
[0035] Multi-object tracking module: Used to process the current frame image using a pre-trained attention-based multi-object tracking model to obtain the prediction result of the current frame image;
[0036] The multi-target tracking includes:
[0037] Feature extraction module: used to extract features from the current frame image pairs at different resolution scales to obtain feature images;
[0038] Feature encoding module: used to encode the feature image through feature encoding to obtain a feature encoding vector;
[0039] The query bootstrapping enhancement module is used to enhance the learnable interpretable query vector of the pre-trained attention-based multi-target tracking model by utilizing the feature encoding vector and the prediction result of the pre-trained attention-based multi-target tracking model on the previous frame image, thereby obtaining an enhanced query vector; the enhanced query vector is added to the joint query vector of the pre-trained attention-based multi-target tracking model to obtain the final query vector.
[0040] Feature decoding module: used to decode the feature encoding vector and the final query vector to obtain the output vector;
[0041] Tracking head module: Used to interpret the output vector into classification and regression results using the tracking head, and obtain the prediction results of the current frame image.
[0042] The present invention also provides an electronic device, comprising:
[0043] One or more processors;
[0044] A storage device on which one or more programs are stored;
[0045] When the one or more programs are executed by the one or more processors, the one or more processors implement the end-to-end multi-target tracking method based on query bootstrapping enhancement as described above.
[0046] The present invention also provides a storage medium storing a computer program thereon, wherein the computer program, when executed by a processor, implements the end-to-end multi-target tracking method based on query bootstrapping enhancement as described above.
[0047] The present invention has the following beneficial effects:
[0048] The attention-based multi-target tracking model structure of this invention uses a self-attention network as its backbone and employs an innovative query bootstrapping enhancement. This involves using the feature encoding vector and the prediction results of the pre-trained attention-based multi-target tracking model on the previous frame to enhance the learnable interpretable query vector of the pre-trained attention-based multi-target tracking model, resulting in an enhanced query vector. This enhanced query vector is then added to the joint query vector of the pre-trained attention-based multi-target tracking model to obtain the final query vector. This query bootstrapping enhancement improves the interpretability of the query vector in representing the target. Furthermore, the query bootstrapping enhancement proposed in this invention is simple and effective, and does not introduce additional preprocessing operations. Through the innovation of the attention-based multi-target tracking model structure, this invention achieves state-of-the-art results in multi-target tracking with relatively little increase in computational complexity and parameter count. Attached Figure Description
[0049] Figure 1 This is the overall framework of the end-to-end multi-target tracking method based on query bootstrapping enhancement in this invention;
[0050] Figure 2 This is a schematic diagram of the query bootstrap enhancement module of the present invention;
[0051] Figure 3 This is a schematic diagram of the proposed frame and relocation coding module of the present invention. Detailed Implementation
[0052] Exemplary embodiments of the present invention will now be described in more detail with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be implemented in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided to enable a more thorough understanding of the present disclosure and to fully convey the scope of the disclosure to those skilled in the art. It should be noted that, unless otherwise specified, the embodiments and features described herein can be combined with each other. The present invention will now be described in detail with reference to the accompanying drawings and embodiments.
[0053] The end-to-end multi-target tracking method based on query bootstrapping enhancement provided by this invention has its main innovation in terms of network structure:
[0054] Firstly, for general query-based end-to-end multi-target tracking methods, the final query vector is often a joint query vector obtained by concatenating the learnable interpreted query vector (detection query vector) with the vector generated by query iteration in the previous frame (tracking query vector). Since this joint query vector shares the same feature interpreter, it inevitably leads to conflicts between detection and association (tracking). Based on this, this invention proposes an enhancement method for the interpreted query vector, (1) improving the model's target detection capability, and (2) making the detection and tracking query vectors more distinct, allowing them to focus more on their respective domains. Secondly, the query bootstrapping enhancement proposed in this invention comprises two parts: proposal box encoding and relocation encoding. Proposal box encoding can approximate the detection information of the previous frame as an auxiliary proposal box that requires no additional operation, resulting in better detection capability after the interpreted query vector is enhanced. Relocation encoding can utilize proposal box encoding to obtain high-dimensional abstract features corresponding to each individual interpreted query vector on the feature map, thereby enabling the interpreted query vector to focus more on its corresponding target. Finally, since this invention does not introduce any additional auxiliary information or hyperparameters, it can achieve excellent results even in unfamiliar scenarios through simple training.
[0055] This invention's end-to-end multi-target tracking method based on query bootstrapping enhancement mainly consists of five parts: feature extraction to extract visual features of the image; feature encoding to abstractly encode the visual features of the image; query bootstrapping enhancement, which uses the prediction results of the previous frame and the feature image of the current frame to enhance the learnable interpretive query vector; feature decoding, which interprets the abstract encoding of the image's visual features using the interpretive query vector; and the tracking head, which interprets the interpreted vector and finally generates the classification and regression box prediction values required for tracking. In the end-to-end self-attention-based multi-target tracking method based on query bootstrapping enhancement, the input image is first encoded using feature encoding to obtain a feature map, then the feature map is further encoded using feature encoding, then the learnable interpretive query vector is updated using query bootstrapping enhancement, and finally, the interpreted output encoded vector is obtained through feature decoding and fed into the tracking head to obtain the final result.
[0056] For details, see Figures 1-3 This invention relates to an end-to-end multi-target tracking method based on query bootstrapping enhancement, comprising the following steps:
[0057] 1) Obtain video sequences from the database, perform image preprocessing, and then divide them into training and testing sets;
[0058] 2) Construct an end-to-end attention-based multi-target tracking model with query bootstrapping enhancement (i.e., the attention-based multi-target tracking model of this invention), and input the training set into the attention-based multi-target tracking model for training;
[0059] 3) Input the test set into the trained attention-based multi-object tracking model for testing, and obtain the performance and objective evaluation of the deep neural network model (i.e., the attention-based multi-object tracking model);
[0060] 4) Input the video sequence to be processed into the attention-based multi-target tracking model that has passed the test, and the output will include the detection box and target identity information.
[0061] The query bootstrapping enhancement method can simply and effectively improve the performance of self-attention-based multi-target tracking algorithms. The specific implementation method of step 2) is as follows:
[0062] 101) Query bootstrapping-enhanced end-to-end self-attention-based multi-target tracking model F(X) N×H×W ;θ), θ={θ1,θ2,…,θ i …,θ n}, where θ represents the learnable parameter, θ i X represents the parameters of the i-th layer, n represents the total number of layers in the multi-scale neural network, and X represents the parameters of the i-th layer. N×H×WLet N, H, and W represent the input image, respectively, and let H, W be the dimensions, height, and width of the input image. The loss function is:
[0063]
[0064] Where pre and label represent the predicted value and the true value, respectively, L cls ,L cardinality ,L giou These are classification loss, cardinality loss, and overlap loss, respectively. pre ,l label b represents the predicted value and the true value of the category label, respectively. pre ,b label represents the predicted value and ground truth value of the regression box, respectively; IoU represents the overlap calculation; C represents the minimum bounding box calculation; ∪ represents , m represents one of the images in each batch during training; and M represents each batch of images during training.
[0065] 102) The training of the query-bootstrapping-enhanced end-to-end self-attention-based multi-target tracking model is based on obtaining the loss function L(l pre ,l label ,b pre ,b label The optimal value of θ′ in the mapping function F is estimated by finding the optimal value of θ′.
[0066] 103) To obtain the optimal value of the loss function L(x,y) to estimate the optimal value θ′ of the parameter θ in the mapping function F, specifically:
[0067]
[0068] Where l and j are the index of the neural network layer and the number of iterations, respectively, and η is the learning rate. It is the partial derivative of the loss function L(x,y) with respect to the l-th layer at the j-th iteration. After multiple iterations and updates of the parameters in the neural model, the loss function reaches its minimum. At this point, the parameters in the model are the optimal values θ′ of the parameters θ in the mapping function F.
[0069] 104) The query-bootstrapping-enhanced end-to-end self-attention-based multi-target tracking model includes feature extraction, feature encoding, query bootstrapping enhancement, feature decoding, and a tracking head. Feature extraction includes multiple convolutional operations to downsample the input current frame image and increase the channel dimension, while simultaneously extracting features from the current frame image to obtain a feature image, as shown in the formula:
[0070]
[0071] Where N, H, and W are the dimensions, height, and width of the input current frame image, respectively; This represents the feature image obtained through feature extraction, where N′, W′, and H′ represent the dimensions, height, and width of the output image; θ1 represents the parameters in the feature extraction process.
[0072] Feature encoding involves straightening the feature image and then using a multi-layer self-attention mechanism to abstractly encode the features, ultimately generating a feature encoding vector. The formula is as follows:
[0073]
[0074] Where θ2 is the parameter in feature encoding, Y1 is the generated feature encoding vector, dim represents the dimension of the output encoding, and F2 is the feature encoding network.
[0075] Query bootstrapping enhancement utilizes the prediction results of the previous frame and the feature image of the current frame to enhance the learnable interpretable query vector, ultimately generating an interpreted query vector. The formula is as follows:
[0076]
[0077] Where θ3 is a parameter in feature encoding. For the generated interpretation query vector, Q (n+k)×dim This represents the query vector to be interpreted, where n represents the number of codes that can be interpreted, k represents the number of targets detected or tracked in the previous frame, and F3 is the query vector augmentation network.
[0078] Feature decoding involves using multi-layer self-attention and cross-attention mechanisms to interpret the feature encoding vector, ultimately generating an output encoding vector. The formula is as follows:
[0079]
[0080] Where θ4 is the parameter in feature encoding, Y2 is the generated interpretation vector, and F4 is the feature encoding network.
[0081] The tracking head uses a linear layer to interpret the query vector, ultimately generating the classification and regression bounding box predictions needed for tracking. The formula is as follows:
[0082]
[0083]
[0084] Where θ5 and θ6 are the parameters of the classification head and regression head in the tracking head, respectively. These are the generated classification and regression bounding box predictions, respectively. F5 is the classification head of the tracking head, and F6 is the regression head in the tracking head.
[0085] The query bootstrapping enhancement method can simply and effectively improve the performance of self-attention-based multi-target tracking algorithms. The specific implementation method of step 2) above is as follows:
[0086] 201) Improve the convergence speed of the attention-based multi-target tracking model by using the gradient optimization method of adaptive moment estimation, given hyperparameters 0≤β1,β2≤1, given time step t, and momentum v t That is, mini-batch stochastic gradient g t Exponential moving average:
[0087] m t =η[β1m t-1 +(1-β1)g t ]
[0088]
[0089] Where η represents the learning rate, m t and v t Let m represent the first and second moments of the gradient, respectively. During the iteration phase, m t and v t The offset correction formula is:
[0090]
[0091]
[0092] And according to the formula above, for each parameter μ t Update:
[0093]
[0094] Where β1, β2, and ∈ are preset parameters, and μ represents the parameters in the model. t Let m' be the value of μ in the t-th step. t and v t ′ are the estimated values of the first and second moments of the gradient after offset correction, respectively.
[0095] The query bootstrapping enhancement method described in 202) includes the following:
[0096] Suggestion box coding:
[0097]
[0098] Where σ represents the Sigmoid activation function, and F is the input to the channel attention module. This represents the dot product operation, and Maxpool represents the max pooling operation. This is the output of the max pooling layer. These are the parameters of the two weight matrices.
[0099] Linear is a linear layer. This represents the predicted bounding box value of the previous frame, where n represents the number of codes that can be interpreted, k represents the number of targets detected or tracked in the previous frame, and t represents the current time.
[0100] The relocation encoding is as follows:
[0101]
[0102] Where f 7×7 For convolution operations of kernel size, Output of the spatial attention model
[0103] in This represents the image obtained through feature extraction. N′, W′, and H′ represent the dimensions, height, and width of the output image. CA represents the cross-attention mechanism, SA represents the self-attention mechanism, and the subscript 4 is an abbreviation for passing through four identical modules consecutively without giving the module parameters to be independent of each other.
[0104] The specific implementation method of step 3) above is as follows:
[0105] 301) The performance and objective metrics of the attention-based multi-scale neural network models obtained through testing include the following:
[0106]
[0107]
[0108]
[0109]
[0110] Where TP represents a target whose predicted value and ground truth value both exist under a certain IoU threshold, FP represents a target whose predicted value exists under a certain IoU threshold but not under the ground truth value, and FN represents a target whose predicted value does not exist under a certain IoU threshold but exists under the ground truth value. IDTP, IDFP, and IDFN represent targets whose discrimination condition has been changed from a regression box satisfying a certain IoU threshold to identity information, respectively. A(c) is then:
[0111]
[0112] TPA(c), FPA(c), and FNA(c) respectively represent that, in addition to the original discrimination conditions, the identity information must also be consistent.
[0113] Example
[0114] refer to Figures 1-2 The end-to-end multi-target tracking method based on query bootstrapping enhancement in this embodiment mainly includes five steps: feature extractor, feature encoding, query bootstrapping enhancement, feature decoding, and tracking.
[0115] 1) Figure 1 This is a schematic diagram of the overall framework of this embodiment. The input to the neural network model that completes the multi-target tracking task is image I. in The output is the classification and regression result O. During training, to facilitate this process, five consecutive frames are extracted from a single video segment in the training set as a batch. These five frames are fed into the network sequentially, one frame at a time, and the loss function is calculated for each frame individually. However, gradient backpropagation and parameter updates are performed only after all five frames have been processed. The network will learn a function (model) f that satisfies the following relationship:
[0116] f(I in ) = O
[0117] Specifically, the network first passes through a feature extractor from the original input image I in Four feature images F1, F2, F3, and F4 are extracted. Then, F1, F2, F3, and F4 are straightened and encoded using feature encoding to obtain the encoded F. enc Then, using the regression bounding box obtained from the previous frame... t-1 The enhanced query vector Q is obtained by using query bootstrapping to enhance the features F1, F2, F3, and F4 of the current frame. boost Secondly, use the enhanced query vector and the joint query vector Q. joint The sum is the final query vector Q. new =Q joint +Q boost Ultimately from F enc and the joint query vector Q joint The output vector Q is obtained through feature decoding. out Then, the tracking head is used to interpret the output vector into classification and regression results O.
[0118] 2) Figure 2 This embodiment includes a query bootstrap enhancement module, which comprises two parts: suggestion box encoding and relocation encoding, organically combined using residual connections. The query bootstrap enhancement module receives the regression bounding boxes (Boxes) obtained from the previous frame. t-1 and the feature map FM extracted from the current frame t As input, the regression bounding box from the previous frame is first input into the relocalization encoding module to obtain the proposal box encoding P. tThen, the suggestion box encoding and the current frame feature map are input into the relocalization encoding module to obtain the relocalization encoding R. t Finally, residual linking is used to integrate suggestion box encoding and relocation encoding into an enhanced query vector Q. boost The specific process is as follows:
[0119] P t =Enc p (Box t-1 )
[0120] R t =Enc r (P t FM t )
[0121] Q boost =P t +R t
[0122] 3) Figure 3 This embodiment includes the proposal box encoding and relocation encoding modules. For proposal box encoding, since the input feature information (regressed bounding boxes) is only 4-dimensional (dim=4), and the encoded features are only 256-dimensional (dim=256), the information content is relatively small. Therefore, this invention adopts a simple and straightforward single-layer linear layer as the encoding module. The specific process of this module is as follows:
[0123] P t =Enc p (Box t-1 )
[0124] =Linear(Box) t-1 )
[0125] For the relocation coding module, a four-layer separable cross-attention mechanism is used. Each separable cross-attention mechanism includes both a self-attention mechanism and a cross-attention mechanism. The first separable cross-attention mechanism receives the proposal box encoding P. t and the current frame feature map FM t As input, the feature map inputs for the subsequent separable cross-attention mechanisms remain unchanged, but the bounding box encoding needs to be changed to the output of the previous separable cross-attention mechanism. The specific process of this module is as follows:
[0126]
[0127]
[0128]
[0129]
[0130] Where CA represents cross-attention mechanism and SA represents self-attention mechanism.
[0131] It can be seen that the self-attention-based multi-target tracking method provided by this invention utilizes innovative designs suitable for multi-target tracking, such as interpreting query vector bootstrapping enhancement, and effectively solves the problems of detection and association (tracking) conflict and complex training process in existing technologies.
[0132] Furthermore, the query-based bootstrapping enhancement proposed in this invention for multi-object tracking effectively utilizes unsupervised generated suggestion box information for relocalization, thereby significantly alleviating the original problem of detection and association (tracking) conflict and achieving the best performance to date in multi-object tracking tasks. At the same time, unlike other methods, this invention does not require the introduction of additional auxiliary information or hyperparameter tuning. Utilizing the query-based bootstrapping enhancement method can simply and effectively improve the performance of self-attention-based multi-object tracking algorithms while only slightly increasing computational complexity and the number of parameters.
[0133] This invention also provides an end-to-end multi-target tracking system based on query bootstrapping enhancement, used to implement the end-to-end multi-target tracking method based on query bootstrapping enhancement described in any of the above embodiments. The system includes:
[0134] Multi-object tracking module: Used to process the current frame image using a pre-trained attention-based multi-object tracking model to obtain the prediction result of the current frame image;
[0135] The multi-target tracking includes:
[0136] Feature extraction module: used to extract features from the current frame image pairs at different resolution scales to obtain feature images;
[0137] Feature encoding module: used to encode the feature image through feature encoding to obtain a feature encoding vector;
[0138] The query bootstrapping enhancement module is used to enhance the learnable interpretable query vector of the pre-trained attention-based multi-target tracking model by utilizing the feature encoding vector and the prediction result of the pre-trained attention-based multi-target tracking model on the previous frame image, thereby obtaining an enhanced query vector; the enhanced query vector is added to the joint query vector of the pre-trained attention-based multi-target tracking model to obtain the final query vector.
[0139] Feature decoding module: used to decode the feature encoding vector and the final query vector to obtain the output vector;
[0140] Tracking head module: Used to interpret the output vector into classification and regression results using the tracking head, and obtain the prediction results of the current frame image.
[0141] Compared with existing multi-target tracking networks, this invention constructs a neural network algorithm model based on a query end-to-end network. By employing query bootstrapping enhancement, it can effectively improve the network's multi-target tracking capability without introducing additional data and hyperparameters.
[0142] The embodiments of the present invention also provide corresponding electronic devices and computer-readable storage media for implementing the solutions provided in the embodiments of the present invention.
[0143] The electronic device includes a storage device and a processor. The storage device is used to store instructions or code, and the processor is used to execute the instructions or code to enable the electronic device to perform the query-based bootstrapping enhancement-based end-to-end multi-target tracking method described in any embodiment of this application.
[0144] In practical applications, the computer-readable storage medium can be any combination of one or more computer-readable media. The computer-readable medium can be a computer-readable signal medium or a computer-readable storage medium.
[0145] Although the present invention has been described in detail above with general descriptions and specific embodiments, modifications or improvements can be made to it, which will be obvious to those skilled in the art. Therefore, all such modifications or improvements made without departing from the spirit of the present invention fall within the scope of protection claimed by the present invention.
Claims
1. An end-to-end multi-target tracking method based on query bootstrapping enhancement, characterized in that, include: The current frame image is processed using a pre-trained attention-based multi-object tracking model to obtain the prediction result of the current frame image; The process of processing the current frame image using a pre-trained attention-based multi-object tracking model includes: Feature images are obtained by extracting features from the current frame image pairs at different resolution scales; The feature image is encoded using feature encoding to obtain a feature encoding vector; The learnable interpretable query vector of the pre-trained attention-based multi-object tracking model is enhanced using the feature encoding vector and the prediction result of the previous frame image from the pre-trained attention-based multi-object tracking model, resulting in an enhanced query vector. The enhanced query vector is then added to the joint query vector of the pre-trained attention-based multi-object tracking model to obtain the final query vector. The calculation formula for the enhanced query vector is as follows: in, For parameters in feature encoding, For the generated enhanced query vector, This represents a learnable, interpretable query vector for a pre-trained attention-based multi-object tracking model. This indicates the number of codes that can be decoded. This indicates the number of targets detected or tracked in the previous frame. Indicates the dimension of the output encoding. To enhance the network for query vectors, For feature images, Represents the dimensions, height, and width of the feature image; The feature encoding vector and the final query vector are used to obtain the output vector through feature decoding; the calculation formula for the output vector is as follows: in, For the output vector, For feature decoding networks, For feature encoding vector, To enhance the query vector, For parameters in feature encoding, This indicates the number of codes that can be decoded. This indicates the number of targets detected or tracked in the previous frame. Indicates the dimension of the output encoding. Represents the dimensions, height, and width of the feature image; The tracking head is used to interpret the output vector into classification and regression results, and the prediction results of the current frame image are obtained.
2. The end-to-end multi-target tracking method based on query bootstrapping enhancement according to claim 1, characterized in that, The pre-trained attention-based multi-object tracking models are as follows: , in, Indicates learnable parameters, Indicates the first i Layer parameters, This represents the total number of layers in an attention-based multi-object tracking model. i =1…… n , Indicates the input image. N , H , W These are the dimensions, height, and width of the input image, respectively. The loss function of a pre-trained attention-based multi-object tracking model is: in, These represent the predicted value and the true value, respectively. These are classification loss, cardinality loss, and overlap loss, respectively. These represent the predicted and true values for the category labels, respectively. These represent the predicted and true values of the regression box, respectively. This indicates the calculation of overlap. This indicates the calculation of the minimum bounding box. m This represents one image from each batch during training. M This represents each batch of images during training.
3. The end-to-end multi-target tracking method based on query bootstrapping enhancement according to claim 1, characterized in that, The process of extracting features from the current frame image pairs at different resolution scales to obtain feature images includes: Multiple convolution operations are used to downsample the current frame image and increase the channel dimension, while simultaneously extracting features from the current frame image to obtain a feature image.
4. The end-to-end multi-target tracking method based on query bootstrapping enhancement according to claim 1, characterized in that, The process of encoding the feature image using feature encoding to obtain the feature encoding vector includes: After straightening the feature image, a multi-layer self-attention mechanism is used to abstract and encode the features, and finally a feature encoding vector is generated.
5. The end-to-end multi-target tracking method based on query bootstrapping enhancement according to claim 1, characterized in that, The process of interpreting the output vector into classification and regression results using a tracking head includes: applying a linear layer to the output vector to interpret the interpreted vector, and finally generating the classification and regression bounding box predictions required for tracking, as shown in the following formula: in, These are the parameters of the classification head and regression head in the tracking head, respectively. These are the generated classification and regression bounding box predictions, respectively. To track the classification head in the head, To track the regression head in the head, For the output vector, This indicates the number of codes that can be decoded. This indicates the number of targets detected or tracked in the previous frame. Indicates the dimension of the output encoding.
6. An end-to-end multi-target tracking system based on query bootstrapping enhancement, used to implement the end-to-end multi-target tracking method based on query bootstrapping enhancement as described in any one of claims 1-5, characterized in that, include: Multi-object tracking module: Used to process the current frame image using a pre-trained attention-based multi-object tracking model to obtain the prediction result of the current frame image; The multi-target tracking includes: Feature extraction module: used to extract features from the current frame image pairs at different resolution scales to obtain feature images; Feature encoding module: used to encode the feature image through feature encoding to obtain a feature encoding vector; The query bootstrapping enhancement module is used to enhance the learnable interpretable query vector of the pre-trained attention-based multi-target tracking model by utilizing the feature encoding vector and the prediction result of the pre-trained attention-based multi-target tracking model on the previous frame image, thereby obtaining an enhanced query vector; the enhanced query vector is added to the joint query vector of the pre-trained attention-based multi-target tracking model to obtain the final query vector. Feature decoding module: used to decode the feature encoding vector and the final query vector to obtain the output vector; Tracking head module: Used to interpret the output vector into classification and regression results using the tracking head, and obtain the prediction results of the current frame image.
7. An electronic device, characterized in that, include: One or more processors; A storage device on which one or more programs are stored; When the one or more programs are executed by the one or more processors, the one or more processors implement the end-to-end multi-target tracking method based on query bootstrapping enhancement as described in any one of claims 1 to 5.
8. A storage medium, characterized in that, It stores a computer program, wherein the computer program, when executed by a processor, implements the end-to-end multi-target tracking method based on query bootstrapping enhancement as described in any one of claims 1 to 5.