Target tracking method, and training method and device of target tracking model

By fusing channel-level features using a dynamic multilayer perceptron and an attention network, the problem of insufficient target tracking accuracy of linear Kalman filters in nonlinear systems is solved, achieving higher tracking accuracy and stability.

CN116612152BActive Publication Date: 2026-06-16JD DIGITS HAIYI INFORMATION TECHNOLOGY CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
JD DIGITS HAIYI INFORMATION TECHNOLOGY CO LTD
Filing Date
2023-05-26
Publication Date
2026-06-16

AI Technical Summary

Technical Problem

In existing technologies, linear Kalman filters lead to inaccurate target position prediction in nonlinear dynamic systems, especially in crowded scenarios or scenarios where target motion is highly nonlinear, resulting in decreased tracking accuracy.

Method used

By employing a dynamic multilayer perceptron and an attention network, a sequence of second feature vectors is generated, and channel-level features are fused to improve the accuracy of target object detection boxes. Feature extraction and prediction are performed using a dynamic fully connected layer and an attention model.

🎯Benefits of technology

It improves the accuracy and stability of target tracking, especially in cases with complex motion patterns and consistent target appearance, where it outperforms the Kalman filter.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116612152B_ABST
    Figure CN116612152B_ABST
Patent Text Reader

Abstract

This disclosure relates to a target tracking method, a training method and apparatus for a target tracking model, and pertains to the field of artificial intelligence technology. The target tracking method includes: acquiring an image sequence; generating a sequence of first feature vectors based on a sequence of detection boxes of target objects in multiple images; and generating a sequence of second feature vectors based on the sequence of first feature vectors, including the element E corresponding to the j-th channel of the i-th first feature vector in the sequence of first feature vectors. [i,j] Determine element E [i,j] Related element E [k,j] According to the relevant element E [k,j] The value of the second feature vector is used to determine the value of the element corresponding to the j-th channel of the i-th second feature vector in the sequence of second feature vectors; based on the sequence of second feature vectors, the detection box of the target object in the target image is determined. According to this disclosure, the accuracy of target tracking is improved.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to the field of artificial intelligence technology, and in particular to target tracking methods, target tracking model training methods and apparatus. Background Technology

[0002] Object tracking, a key technology in computer vision, has wide applications in autonomous driving, intelligent monitoring, robot navigation, and sports video analysis. Multi-object tracking involves detecting and tracking the trajectories of objects such as pedestrians, cars, and animals in videos to enable subsequent tasks such as trajectory prediction and precise location.

[0003] Kalman filtering is a technique that uses the state equations of a linear system to estimate the system state by observing the system's input and output data. In related technologies, linear Kalman filters are used to model the motion state of a target object. Summary of the Invention

[0004] According to a first aspect of this disclosure, a target tracking method is provided, comprising: acquiring an image sequence, wherein the image sequence includes a target image and a plurality of images preceding the target image; generating a sequence of first feature vectors based on a sequence of detection boxes of target objects in the plurality of images, wherein the sequence of first feature vectors includes N first feature vectors, each first feature vector including M elements corresponding one-to-one with M channels, where N and M are both positive integers; and generating a sequence of second feature vectors based on the sequence of first feature vectors, including an element E corresponding to the j-th channel of the i-th first feature vector in the sequence of first feature vectors. [i,j] Determine element E [i,j] Related element E [k,j] Among them, the relevant element E [k,j] Let i represent the element corresponding to the j-th channel of the k-th first eigenvector in the sequence of first eigenvectors, where i, j, and k are all natural numbers, i ≤ N, k ≤ N, j ≤ M, and i and k are not equal; according to the relevant element E [k,j] The value of is used to determine the value of the element corresponding to the j-th channel of the i-th second feature vector in the sequence of second feature vectors; based on the sequence of second feature vectors, the detection box of the target object in the target image is determined.

[0005] In some embodiments, for the element E corresponding to the j-th channel of the i-th first feature vector in the sequence of first feature vectors [i,j] Determine element E [i,j] Related element E [k,j] This includes: using the first fully connected layer, calculating element E based on the i-th first feature vector. [i,j] offset δ j , where δ jIt is an integer; based on the offset δ j Calculate k.

[0006] In some embodiments, based on the offset δ j Calculate k, including: based on i and δ j The sum of and , determine k.

[0007] In some embodiments, the value of the element corresponding to the j-th channel of the i-th second feature vector in the sequence of second feature vectors is related to element E. [k,j] The values ​​are positively correlated. 5

[0008] In some embodiments, determining the detection box of the target object in the target image based on the sequence of second feature vectors includes: using a second fully connected layer to generate the i-th third feature vector in the sequence of third feature vectors based on the i-th second feature vector in the sequence of second feature vectors; and determining the detection box of the target object in the target image based on the sequence of third feature vectors.

[0009] In some embodiments, determining the detection box of a target object in a target image based on the sequence of third feature vectors includes: generating a sequence of fourth feature vectors based on the sequence of first feature vectors and the sequence of third feature vectors; generating a sequence of fifth feature vectors based on the sequence of detection boxes of target objects in multiple images using an attention network; and determining the detection box of the target object in the target image based on the sequence of fourth feature vectors and the sequence of fifth feature vectors.

[0010] In some embodiments, generating a sequence of fourth feature vectors based on a sequence of first feature vectors and a sequence of third feature vectors includes: generating the i-th fourth feature vector in the sequence of fourth feature vectors based on a weighted sum of the i-th first feature vector and the i-th third feature vector.

[0011] In some embodiments, determining the detection box of the target object in the target image based on the sequence of the fourth feature vector and the sequence of the fifth feature vector includes: determining the detection box of the target object in the target image based on the weighted sum of the sequence of the fourth feature vector and the sequence of the fifth feature vector.

[0012] In some embodiments, generating a sequence of first feature vectors based on a sequence of detection boxes of target objects in multiple images includes: for the i-th image of the multiple images, calculating a first change in the attribute of the detection box of the target object in the i-th image relative to the attribute of the detection box of the target object in the (i-1)-th image; determining the features of the detection box of the i-th image based on the first change in the attribute of the detection box of the target object in the i-th image relative to the attribute of the detection box of the target object in the (i-1)-th image; and generating the i-th first feature vector based on the features of the detection box of the i-th image.

[0013] In some embodiments, the attributes include at least one of the following: the height and width of the detection box, the ratio of the width to the height, and the coordinates of the center point of the detection box.

[0014] In some embodiments, the plurality of images include a previous image of the target image. Determining a detection box of a target object in the target image based on a sequence of second feature vectors includes: determining a second change in the attributes of the detection box of the target object in the target image relative to the attributes of the detection box of the target object in the previous image of the target image based on the sequence of second feature vectors; and determining a detection box of the target object in the target image based on the second change and the detection box of the target object in the previous image of the target image.

[0015] According to a second aspect of this disclosure, a method for training a target tracking model is provided, comprising: acquiring an image sequence, wherein the image sequence includes a target image and multiple images preceding the target image; using the target tracking model, generating a sequence of first feature vectors based on a sequence of detection boxes of target objects in the multiple images, wherein the sequence of first feature vectors includes N first feature vectors, each first feature vector including M elements corresponding one-to-one with M channels, where N and M are both positive integers; and generating a sequence of second feature vectors based on the sequence of first feature vectors, including the element E corresponding to the j-th channel of the i-th first feature vector in the sequence of first feature vectors. [i,j] Determine element E [i,j] Related element E [k,j] Among them, the relevant element E [k,j] Let i represent the element corresponding to the j-th channel of the k-th first eigenvector in the sequence of first eigenvectors, where i, j, and k are all natural numbers, i ≤ N, k ≤ N, j ≤ M, and i and k are not equal; according to the relevant element E [k,j] The value of the second feature vector is used to determine the value of the element corresponding to the j-th channel of the i-th second feature vector in the sequence of second feature vectors; based on the sequence of second feature vectors, a prediction result of the detection box of the target object in the target image is generated; based on the prediction result, the target tracking model is trained.

[0016] According to a third aspect of this disclosure, a target tracking apparatus is provided, comprising: an acquisition unit configured to acquire an image sequence, wherein the image sequence includes a target image and a plurality of images preceding the target image; a first generation unit configured to generate a sequence of first feature vectors based on a sequence of detection boxes of a target object in the plurality of images, wherein the sequence of first feature vectors includes N first feature vectors, each first feature vector including M elements corresponding one-to-one with M channels, where N and M are both positive integers; and a second generation unit configured to generate a sequence of second feature vectors based on the sequence of first feature vectors, including an element E corresponding to the j-th channel of the i-th first feature vector in the sequence of first feature vectors. [i,j] Determine element E [i,j] Related element E [k,j] Among them, the relevant element E [k,j] Let i represent the element corresponding to the j-th channel of the k-th first eigenvector in the sequence of first eigenvectors, where i, j, and k are all natural numbers, i ≤ N, k ≤ N, j ≤ M, and i and k are not equal; according to the relevant element E [k,j] The value of determines the value of the element corresponding to the j-th channel of the i-th second feature vector in the sequence of second feature vectors; the determining unit is configured to determine the detection box of the target object in the target image based on the sequence of second feature vectors.

[0017] According to a fourth aspect of this disclosure, a training apparatus for a target tracking model is provided, comprising: an acquisition unit configured to acquire an image sequence, wherein the image sequence includes a target image and a plurality of images preceding the target image; a first generation unit configured to generate a sequence of first feature vectors based on a sequence of detection boxes of target objects in the plurality of images, wherein the sequence of first feature vectors includes N first feature vectors, each first feature vector including M elements corresponding one-to-one with M channels, where N and M are both positive integers; and a second generation unit configured to generate a sequence of second feature vectors based on the sequence of first feature vectors, including an element E corresponding to the j-th channel of the i-th first feature vector in the sequence of first feature vectors. [i,j] Determine element E [i,j] Related element E [k,j] Among them, the relevant element E [k,j] Let i represent the element corresponding to the j-th channel of the k-th first eigenvector in the sequence of first eigenvectors, where i, j, and k are all natural numbers, i ≤ N, k ≤ N, j ≤ M, and i and k are not equal; according to the relevant element E [k,j]The value of the second feature vector is used to determine the value of the element corresponding to the j-th channel of the i-th second feature vector in the sequence of second feature vectors; the determining unit is configured to determine the detection box of the target object in the target image based on the sequence of second feature vectors. The training unit is configured to train the target tracking model based on the prediction results.

[0018] According to a fifth aspect of this disclosure, an electronic device is provided, comprising: a memory; and a processor coupled to the memory, the processor being configured to execute a target tracking method or a target tracking model training method according to any embodiment of this disclosure based on instructions stored in the memory.

[0019] According to a sixth aspect of this disclosure, a computer-readable storage medium is provided that stores computer program instructions thereon, which, when executed by a processor, implement a target tracking method or a target tracking model training method according to any embodiment of this disclosure. Attached Figure Description

[0020] The accompanying drawings, which form part of this specification, illustrate embodiments of this disclosure and, together with the specification, serve to explain the principles of this disclosure.

[0021] This disclosure will become clearer with reference to the accompanying drawings and the following detailed description, wherein:

[0022] Figure 1 A flowchart illustrating a target tracking method according to some embodiments of the present disclosure is shown;

[0023] Figure 2 A schematic diagram illustrating the detection bounding box and trajectory of a target object according to some embodiments of this disclosure;

[0024] Figure 3 A flowchart illustrating a sequence for generating a second feature vector according to some embodiments of the present disclosure is provided;

[0025] Figure 4 A schematic diagram illustrating the determination of a second feature vector according to some embodiments of the present disclosure is shown;

[0026] Figure 5 A schematic diagram of a dynamic multilayer perceptron according to some embodiments of the present disclosure is shown;

[0027] Figure 6 A schematic diagram of a dynamic fully connected layer according to some embodiments of the present disclosure is shown;

[0028] Figure 7 A schematic diagram of a target tracking model according to some embodiments of the present disclosure is shown;

[0029] Figure 8A schematic diagram illustrating a training method for a target tracking model according to some embodiments of the present disclosure is shown;

[0030] Figure 9 A block diagram of a target tracking apparatus according to some embodiments of the present disclosure is shown;

[0031] Figure 10 A block diagram showing a training apparatus for a target tracking model according to some embodiments of the present disclosure;

[0032] Figure 11 Block diagrams of electronic devices according to other embodiments of the present disclosure are shown;

[0033] Figure 12 A block diagram of a computer system for implementing some embodiments of the present disclosure is shown. Detailed Implementation

[0034] Various exemplary embodiments of the present disclosure will now be described in detail with reference to the accompanying drawings. It should be noted that, unless otherwise specifically stated, the relative arrangement, numerical expressions, and values ​​of the components and steps set forth in these embodiments do not limit the scope of the present disclosure.

[0035] At the same time, it should be understood that, for ease of description, the dimensions of the various parts shown in the accompanying drawings are not drawn according to actual scale.

[0036] The following description of at least one exemplary embodiment is merely illustrative and is in no way intended to limit this disclosure or its application or use.

[0037] Techniques, methods, and equipment known to those skilled in the art may not be discussed in detail, but where appropriate, such techniques, methods, and equipment should be considered part of the specification.

[0038] In all examples shown and discussed herein, any specific values ​​should be interpreted as merely exemplary and not as limitations. Therefore, other examples of exemplary embodiments may have different values.

[0039] It should be noted that similar labels and letters in the following figures indicate similar items; therefore, once an item is defined in one figure, it does not need to be discussed further in subsequent figures.

[0040] In related technologies, linear Kalman filters are used to model the motion state of a target object. Kalman filters assume the system state is linear and the noise is Gaussian. However, in nonlinear dynamic system models, the system state changes nonlinearly, and its noise distribution is not always Gaussian. These nonlinear factors cause deviations between the Kalman filter's estimates and the true values, leading to inaccurate target object position predictions and consequently decreased tracking accuracy. The tracking performance of classic linear Kalman filters is particularly limited in crowded scenarios with highly nonlinear target motion, such as dense crowds or high-speed movement.

[0041] This disclosure provides a target tracking method, a target tracking model training method and apparatus, and a computer-readable medium, which improves the accuracy of target tracking.

[0042] Figure 1 A flowchart illustrating a target tracking method according to some embodiments of the present disclosure is shown.

[0043] like Figure 1 As shown, the target tracking method includes steps S11-S14. In some embodiments, the target tracking method is performed by a target tracking device.

[0044] In step S11, an image sequence is obtained, wherein the image sequence includes the target image and multiple images preceding the target image.

[0045] In step S12, a sequence of first feature vectors is generated based on the sequence of detection boxes of target objects in multiple images. The sequence of first feature vectors includes N first feature vectors, and each first feature vector includes M elements corresponding one-to-one with M channels, where N and M are both positive integers.

[0046] For example, for each target object, the past n past The observations of the detection boxes in the image at each time step are converted into feature vectors, thus obtaining the sequence of the first feature vectors. Where n past It is a positive integer.

[0047] Figure 2 A schematic diagram showing the detection frame and trajectory of a target object according to some embodiments of this disclosure is provided.

[0048] like Figure 2 As shown, multiple images preceding the target image are sorted into an image sequence according to time order (e.g., frame number). The detection bounding boxes of the target object in each image are then obtained, resulting in a sequence of detection bounding boxes.

[0049] In target tracking scenarios, it is possible to simultaneously track the trajectories (tracklets) of one or more target objects. Given a set of all target object trajectories: T = {T1, T2, ...}, for any target object l, its trajectory T... l Represented by an ordered set of detection boxes (i.e., a sequence of detection boxes), for example:

[0050]

[0051] in, Represents trajectory T l The detection box at time t1.

[0052] In some embodiments, generating a sequence of first feature vectors based on a sequence of detection boxes of target objects in multiple images includes: for the i-th image of the multiple images, calculating a first change in the attribute of the detection box of the target object in the i-th image relative to the attribute of the detection box of the target object in the (i-1)-th image; determining the features of the detection box in the i-th image based on the first change in the attribute of the detection box of the target object in the i-th image relative to the attribute of the detection box of the target object in the (i-1)-th image; and generating the i-th first feature vector based on the features of the detection box in the i-th image. Wherein, starting from the second image, each image corresponds to one first feature vector.

[0053] In some embodiments, the attributes include at least one of the following: the height and width of the detection box, the ratio of the width to the height, and the coordinates of the center point of the detection box.

[0054] For example, the features (i.e., input representation) of the target object's historical trajectory are shown below:

[0055] X = (…, x t-2 ,x t-1 )

[0056] in, That is, the dimension of X is N×M.

[0057] exist Figure 2 The input represents the features of the detection box of a target object in an image, with each cell representing a feature of the detection box of the same target object in the same row. Hollow cells indicate that the detection box is missing in this frame of the image, for example, the target object does not exist in this frame of the image, or the target object was not detected.

[0058] The features of the target object in the i-th image are:

[0059]

[0060] Among them, (C) x Cy Let be the center coordinates of the detection box of the target object in the i-th image, and let w, h, and a represent the attributes of the detection box. δ w ,δ h This represents the first variable of an attribute. For example, w, h, and a represent the width, height, and width-to-height ratio of the detection box, respectively. δ w ,δ h Let C be the coordinates of the center position of the detection frame. x and C y The changes in width w and height h relative to the detection box in the image at the previous observation time t-2.

[0061] In some embodiments, the first change in the attribute of the detection box of the target object in the i-th image relative to the attribute of the detection box of the target object in the (i-1)-th image is the difference or ratio between the attribute of the detection box of the target object in the i-th image and the attribute of the detection box of the target object in the (i-1)-th image.

[0062] In some embodiments, an embedding network is used to transform the features of the target object in the i-th image into a higher-dimensional space to obtain the input embedding.

[0063] Then, sinusoidal encoding information is added to the input embedding to obtain the sequence E of the first feature vector, so that the input embedding... It includes relative location information. In the sequence E of the first feature vectors, each first feature vector is a token.

[0064] In step S13, a sequence of second feature vectors is generated based on the sequence of the first feature vectors.

[0065] For example, in target tracking, continuous observations of the position of a target object (represented by a detection box) correspond to different motion patterns. In the feature space, since different positional attributes of the first feature vector (e.g., the coordinates and relative position changes of the detection box corresponding to a first feature vector) are distributed in different channels, it is difficult to model the relationship between different first feature vectors. Therefore, this disclosure adds a Dynamic Multilayer Perceptron to fuse channel-level features in the first feature vector to generate a sequence of second feature vectors.

[0066] Figure 3 A flowchart illustrating the sequence for generating a second feature vector according to some embodiments of the present disclosure is shown.

[0067] like Figure 3As shown, step S13 includes steps S131 and S132.

[0068] In step S131, for the element E corresponding to the j-th channel of the i-th first feature vector in the sequence of first feature vectors... [i,j] Determine element E [i,j] Related element E [k,j] Among them, the relevant element E [k,j] Let i represent the element corresponding to the j-th channel of the k-th first eigenvector in the sequence of first eigenvectors, where i, j, and k are all natural numbers, i≤N, k≤N, j≤M, and i and k are not equal.

[0069] Among them, the relevant element E [k,j] It is element E [i,j] Related elements. For example, the element E corresponding to the j-th channel of the i-th first feature vector. [i,j] The target tracking model determines the element E corresponding to the j-th channel in the sequence of other first feature vectors. [k,j] With element E [i,j] Related to element E [k,j] This is also an element that needs attention.

[0070] In some embodiments, for the element E corresponding to the j-th channel of the i-th first feature vector in the sequence of first feature vectors [i,j] Determine element E [i,j] Related element E [k,j] This includes: using the first fully connected layer, calculating element E based on the i-th first feature vector. [i,j] offset δ j , where δ j It is an integer; based on the offset δ j Calculate k.

[0071] Figure 4 A schematic diagram illustrating the determination of a second feature vector according to some embodiments of the present disclosure is shown.

[0072] like Figure 4 As shown, any first eigenvector e in E is represented by a column of light-colored squares containing a pentagram. i Each cell represents an element, and Dark squares represent the relevant element E. [k,j] Offset δ j That is element E [i,j] Related element E of the same channel [k,j] The distance between them. For example, element E. [i,j] Offset two units to the left or one unit to the right to find the relevant element E. [k,j] .

[0073] Using the first fully connected layers (FC), predict element E [i,j] Offset δ in the j-th channel j For example, e i As the input to the first connection layer, the output e i The set of offsets of all elements, denoted as

[0074] In some embodiments, based on the offset δ j Calculate k, including: based on i and δ j The sum of δ determines k. For example, k = i + δ j .

[0075] In step S132, based on the relevant element E [k,j] The value of determines the value of the element corresponding to the j-th channel of the i-th second feature vector in the sequence of second feature vectors.

[0076] Because in step S131, for element E [k,j] Relative to element E [i,j] The offset is not constrained, so E is searched within the entire sequence of the first eigenvector. [k,j] Then in step S132, according to E [k,j] Determine the value of the element corresponding to the j-th channel of the i-th second feature vector in the sequence of second feature vectors, so that the second feature vectors can gather global channel information.

[0077] In some embodiments, the value of the element corresponding to the j-th channel of the i-th second feature vector in the sequence of second feature vectors is related to element E. [k,j] The values ​​are positively correlated.

[0078] For example, the i-th second eigenvector is calculated according to the following formula.

[0079]

[0080] In step S14, the detection box of the target object in the target image is determined based on the sequence of the second feature vectors.

[0081] After obtaining the sequence of the second feature vectors, the dynamic multilayer perceptron further extracts features.

[0082] Figure 5 A schematic diagram of a dynamic multilayer perceptron according to some embodiments of the present disclosure is shown.

[0083] like Figure 5As shown, the dynamic multilayer perceptron includes a channel fusion layer (CFL), which consists of a dynamic fully-connected layer (DyFC) and an identity transformation layer.

[0084] Figure 6 A schematic diagram of a dynamic fully connected layer according to some embodiments of the present disclosure is shown.

[0085] Figure 6 In the process, the dynamically fully connected layer first generates the second feature vector, and then further generates the third feature vector.

[0086] In some embodiments, determining the detection box of the target object in the target image based on the sequence of second feature vectors includes: using a second fully connected layer to generate the i-th third feature vector in the sequence of third feature vectors based on the i-th second feature vector in the sequence of second feature vectors; and determining the detection box of the target object in the target image based on the sequence of third feature vectors.

[0087] For example, the i-th third eigenvector is calculated using the following formula.

[0088]

[0089] In some embodiments, determining the detection box of a target object in a target image based on the sequence of third feature vectors includes: generating a sequence of fourth feature vectors based on the sequence of first feature vectors and the sequence of third feature vectors; generating a sequence of fifth feature vectors based on the sequence of detection boxes of target objects in multiple images using an attention network; and determining the detection box of the target object in the target image based on the sequence of fourth feature vectors and the sequence of fifth feature vectors.

[0090] For example, Figure 5 The identity transformation layer in the code preserves the original information of the first eigenvector, and the output of the identity transformation layer is the same as the first eigenvector; that is, the output of the identity transformation layer... according to and Determine the i-th fourth eigenvector.

[0091] In some embodiments, generating a sequence of fourth feature vectors based on a sequence of first feature vectors and a sequence of third feature vectors includes: generating the i-th fourth feature vector in the sequence of fourth feature vectors based on a weighted sum of the i-th first feature vector and the i-th third feature vector.

[0092] For example, and As input to the channel fusion layer, calculate and Weighted sum along the channel dimension (i.e., and The calculation formula is as follows (by weighting the two element values ​​in the same channel).

[0093]

[0094] Where ⊙ represents the Hadamarda accumulation.

[0095] ω is calculated using the following formula. I ,ω T :

[0096]

[0097] in, yes and The average value in the channel dimension, W I W T For learnable parameters, softmax(·) represents the normalization operation in the channel dimension.

[0098] By utilizing the identity transformation layer and calculating the weighted sum of the i-th first feature vector and the i-th third feature vector, the target tracking model can be made more stable and easier to converge during training.

[0099] Figure 7 A schematic diagram of a target tracking model according to some embodiments of the present disclosure is shown.

[0100] like Figure 7 As shown, the attention model is extended to allow parallel operation of the attention model and the dynamic multilayer perceptron model that fuses channel-level features. Each dual-granularity coding layer consists of two parts: multiple multi-headed self-attention (MHSA) layers and multiple dynamic multilayer perceptron layers. The MHSA layers are connected by residuals, and the dynamic multilayer perceptron layers are also connected by residuals.

[0101] In some embodiments, determining the detection box of the target object in the target image based on the sequence of the fourth feature vector and the sequence of the fifth feature vector includes: determining the detection box of the target object in the target image based on the weighted sum of the sequence of the fourth feature vector and the sequence of the fifth feature vector.

[0102] For example, the output of the attention network is the sequence MHSA(E) of the fifth feature vector. l-1The output of the dynamic multilayer perceptron is the sequence of the fourth feature vector, DyMLP(E). l-1 The output of the dual-granularity coding layer is calculated according to the following formula.

[0103] DIF(E l-1 )=MHSA(E l-1 )+DyMLP(E l-1 )

[0104] According to the following formula, DIF(E) l-1 Perform layer normalization (LN) and sum the normalization result with the sequence of the first feature vector. The calculation formula is shown below.

[0105]

[0106] Then, After passing through the feedforward network, layer normalization and summation are performed, and the calculation formula is as follows:

[0107]

[0108] Where l represents the l-th coding layer, E l-1 This represents the output of layer l-1, and FFN represents a feedforward neural network.

[0109] In some embodiments, the plurality of images include a previous image of the target image. Determining a detection box of a target object in the target image based on a sequence of second feature vectors includes: determining a second change in the attributes of the detection box of the target object in the target image relative to the attributes of the detection box of the target object in the previous image of the target image based on the sequence of second feature vectors; and determining a detection box of the target object in the target image based on the second change and the detection box of the target object in the previous image of the target image.

[0110] For example, after feature enhancement through L coding layers, the feature is processed by a pooling layer, and then the prediction head predicts the second change in the target detection box. Among them, the pooling layer is, for example, a mean pooling layer, and the prediction head is, for example, a linear layer.

[0111] Second change For example, the center coordinates C of the detection box of the target object in the target image relative to the detection box in the previous image. x and C y The changes in width w and height h.

[0112] In some embodiments, the attributes of the detection boxes of the target objects in the target image are determined based on the sum of the second change amount and the attributes of the detection boxes of the target objects in the previous image of the target image, thereby obtaining a set of detection boxes of one or more target objects in the target image.

[0113] For example, performing object detection on a target image yields a set D of bounding boxes for one or more target objects within the image. t Using the Hungarian algorithm, based on and D t Spatial similarity between them will D t The detection results are assigned to the trajectory of the target object.

[0114] For detection results, those not assigned to existing trajectories are initialized as new trajectories. For target objects, if no detection result is assigned to a trajectory for that target object, the trajectory is marked as lost. If the trajectory loss time exceeds a given threshold, the target object is considered to be outside the field of view, and its trajectory is removed from the trajectory set. Lost trajectories can also be re-tracked during the assignment step.

[0115] According to some embodiments of this disclosure, an image sequence is obtained, wherein the image sequence includes a target image and multiple images preceding the target image; a sequence of first feature vectors is generated based on a sequence of detection boxes of the target object in the multiple images, wherein the sequence of first feature vectors includes N first feature vectors, each first feature vector including M elements corresponding one-to-one with M channels, where N and M are both positive integers; a sequence of second feature vectors is generated based on the sequence of first feature vectors, including the element E corresponding to the j-th channel of the i-th first feature vector in the sequence of first feature vectors. [i,j] Determine element E [i,j] Related element E [k,j] Among them, the relevant element E [k,j] Let i represent the element corresponding to the j-th channel of the k-th first eigenvector in the sequence of first eigenvectors, where i, j, and k are all natural numbers, i ≤ N, k ≤ N, j ≤ M, and i and k are not equal; according to the relevant element E [k,j] The value of the element corresponding to the j-th channel of the i-th second feature vector in the sequence of second feature vectors is used to determine the value of the element. Based on the sequence of second feature vectors, the detection box of the target object in the target image is determined. In the feature extraction process, the second feature vector is determined by finding the relevant elements of the element and determining the value of the relevant elements. Channel-level features are fused into the second feature vector, channel information is gathered, and target tracking is performed based on the channel information, thereby improving the accuracy of target tracking.

[0116] Furthermore, this disclosure achieves target tracking based on the motion information of the target object in the video image. When the target appearance is highly uniform or the motion pattern is complex, compared to the Kalman filter, the method of this disclosure is not subject to the linear constraints of the system state, thus improving the accuracy of target tracking.

[0117] Figure 8 A schematic diagram illustrating a training method for a target tracking model according to some embodiments of the present disclosure is shown.

[0118] like Figure 8 As shown, the training method for the target tracking model includes steps S21 to S25. In some embodiments, the training method for the target tracking model is performed by a training device for the target tracking model.

[0119] In step S21, an image sequence is obtained, wherein the image sequence includes the target image and multiple images preceding the target image.

[0120] In step S22, using the target tracking model, a sequence of first feature vectors is generated based on the sequence of detection boxes of target objects in multiple images. The sequence of first feature vectors includes N first feature vectors, and each first feature vector includes M elements corresponding one-to-one with M channels, where N and M are both positive integers.

[0121] In step S23, a sequence of second feature vectors is generated based on the sequence of first feature vectors, including the element E corresponding to the j-th channel of the i-th first feature vector in the sequence of first feature vectors. [i,j] Determine element E [i,j] Related element E [k,j] Among them, the relevant element E [k,j] Let i represent the element corresponding to the j-th channel of the k-th first eigenvector in the sequence of first eigenvectors, where i, j, and k are all natural numbers, i ≤ N, k ≤ N, j ≤ M, and i and k are not equal; according to the relevant element E [k,j] The value of determines the value of the element corresponding to the j-th channel of the i-th second feature vector in the sequence of second feature vectors.

[0122] In step S24, a prediction result of the detection box of the target object in the target image is generated based on the sequence of the second feature vector.

[0123] Steps S21-S24 are similar and will not be repeated here to avoid repetition.

[0124] In step S25, the target tracking model is trained based on the prediction results. For example, the ground truth values ​​of the attributes of the detection boxes of the target objects in the target image are used as labels, and the parameters of the target tracking model are updated based on the labels and the prediction results.

[0125] According to the training method of the target tracking model according to the embodiments of this disclosure, an image sequence is obtained, wherein the image sequence includes a target image and multiple images preceding the target image; a sequence of first feature vectors is generated based on the sequence of detection boxes of the target object in the multiple images, wherein the sequence of first feature vectors includes N first feature vectors, each first feature vector including M elements corresponding one-to-one with M channels, where N and M are both positive integers; a sequence of second feature vectors is generated based on the sequence of first feature vectors, including the element E corresponding to the j-th channel of the i-th first feature vector in the sequence of first feature vectors. [i,j] Determine element E [i,j] Related element E [k,j] Among them, the relevant element E [k,j] Let i represent the element corresponding to the j-th channel of the k-th first eigenvector in the sequence of first eigenvectors, where i, j, and k are all natural numbers, i ≤ N, k ≤ N, j ≤ M, and i and k are not equal; according to the relevant element E [k,j] The value of the element corresponding to the j-th channel of the i-th second feature vector in the sequence of second feature vectors is used to determine the value of the element. Based on the sequence of second feature vectors, the detection box of the target object in the target image is determined. In the feature extraction process, by finding relevant elements and using the values ​​of relevant elements, channel-level features are fused into the second feature vector, channel information is gathered, and target tracking is performed based on the channel information, thereby improving the accuracy of target tracking.

[0126] Furthermore, this disclosure achieves target tracking based on the motion information of the target object in the video image. When the target appearance is highly uniform or the motion pattern is complex, compared to the Kalman filter, the method of this disclosure is not subject to the linear constraints of the system state, thus improving the accuracy of target tracking.

[0127] In some embodiments, determining the detection box of the target object in the target image based on the sequence of second feature vectors includes: using a second fully connected layer to generate the i-th third feature vector in the sequence of third feature vectors based on the i-th second feature vector in the sequence of second feature vectors; and determining the detection box of the target object in the target image based on the sequence of third feature vectors. During training, since the target tracking model has an identity transformation layer, calculating the weighted sum of the i-th first feature vector and the i-th third feature vector makes the target tracking model more stable and easier to converge during training.

[0128] Figure 9 A block diagram of a target tracking device according to some embodiments of the present disclosure is shown.

[0129] like Figure 9 As shown, the target tracking device 9 includes an acquisition unit 91, a first generation unit 92, a second generation unit 93, and a determination unit 94.

[0130] Acquisition unit 91 is configured to acquire an image sequence, wherein the image sequence includes a target image and multiple images preceding the target image, for example, by performing an acquisition such as... Figure 1 Step S11 is shown.

[0131] The first generation unit 92 is configured to generate a sequence of first feature vectors based on a sequence of detection boxes of target objects in multiple images. The sequence of first feature vectors includes N first feature vectors, each first feature vector including M elements corresponding one-to-one with M channels, where N and M are both positive integers. For example, executing... Figure 1 Step S12 is shown.

[0132] The second generation unit 93 is configured to generate a sequence of second feature vectors based on the sequence of first feature vectors, including the element E corresponding to the j-th channel of the i-th first feature vector in the sequence of first feature vectors. [i,j] Determine element E [i,j] Related element E [k,j] Among them, the relevant element E [k,j] Let i represent the element corresponding to the j-th channel of the k-th first eigenvector in the sequence of first eigenvectors, where i, j, and k are all natural numbers, i ≤ N, k ≤ N, j ≤ M, and i and k are not equal; according to the relevant element E [k,j] The value of determines the value of the element corresponding to the j-th channel of the i-th second feature vector in the sequence of second feature vectors. For example, executing ... Figure 1 Step S13 is shown.

[0133] Determining unit 94 is configured to determine the detection bounding box of the target object in the target image based on the sequence of second feature vectors, for example, by performing... Figure 1 Step S14 is shown.

[0134] In some embodiments, the determining unit 94 is further configured to: generate the i-th third feature vector in the sequence of third feature vectors based on the i-th second feature vector in the sequence of second feature vectors using the second fully connected layer; and determine the detection box of the target object in the target image based on the sequence of third feature vectors.

[0135] In some embodiments, the determining unit 94 is further configured to: generate a sequence of fourth feature vectors based on a sequence of first feature vectors and a sequence of third feature vectors; generate a sequence of fifth feature vectors based on a sequence of detection boxes of target objects in multiple images using an attention network; and determine a detection box of a target object in a target image based on a sequence of fourth feature vectors and a sequence of fifth feature vectors.

[0136] In some embodiments, the determining unit 94 is further configured to generate the i-th fourth feature vector in the sequence of fourth feature vectors based on the weighted sum of the i-th first feature vector and the i-th third feature vector.

[0137] In some embodiments, the determining unit 94 is further configured to: determine a detection box of a target object in the target image based on a weighted sum of a sequence of fourth feature vectors and a sequence of fifth feature vectors.

[0138] In some embodiments, the first generation unit 92 is further configured to: for the i-th image of a plurality of images, calculate a first change in the attribute of the detection box of the target object in the i-th image relative to the attribute of the detection box of the target object in the (i-1)-th image; determine the features of the detection box of the i-th image based on the first change in the attribute of the detection box of the target object in the i-th image relative to the attribute of the detection box of the target object in the (i-1)-th image; and generate an i-th first feature vector based on the features of the detection box of the i-th image.

[0139] In some embodiments, the plurality of images includes a preceding image of the target image, and the determining unit 94 is further configured to: determine a second change in the attribute of the detection box of the target object in the target image relative to the attribute of the detection box of the target object in the preceding image of the target image, based on the sequence of the second feature vectors; and determine the detection box of the target object in the target image based on the second change and the detection box of the target object in the preceding image of the target image.

[0140] According to some embodiments of the target tracking apparatus disclosed herein, during the feature extraction process, by searching for relevant elements and utilizing the values ​​of the relevant elements, channel-level features are fused into the second feature vector, channel information is aggregated, and target tracking is performed based on the channel information, thereby improving the accuracy of target tracking.

[0141] Figure 10 A block diagram of a training apparatus for a target tracking model according to some embodiments of the present disclosure is shown.

[0142] like Figure 10 As shown, the training device 10 for the target tracking model includes an acquisition unit 101, a first generation unit 102, a second generation unit 103, a determination unit 104, and a training unit 105.

[0143] Acquisition unit 101 is configured to acquire an image sequence, wherein the image sequence includes a target image and multiple images preceding the target image, for example, performing an acquisition such as... Figure 8 Step S21 is shown.

[0144] The first generation unit 102 is configured to generate a sequence of first feature vectors based on a sequence of detection boxes of target objects in multiple images. The sequence of first feature vectors includes N first feature vectors, each first feature vector including M elements corresponding one-to-one with M channels, where N and M are both positive integers. For example, executing... Figure 8 Step S22 is shown.

[0145] The second generation unit 103 is configured to generate a sequence of second feature vectors based on the sequence of first feature vectors, including the element E corresponding to the j-th channel of the i-th first feature vector in the sequence of first feature vectors. [i,j] Determine element E [i,j] Related element E [k,j] Among them, the relevant element E [k,j] Let i represent the element corresponding to the j-th channel of the k-th first eigenvector in the sequence of first eigenvectors, where i, j, and k are all natural numbers, i ≤ N, k ≤ N, j ≤ M, and i and k are not equal; according to the relevant element E [k,j] The value of determines the value of the element corresponding to the j-th channel of the i-th second feature vector in the sequence of second feature vectors. For example, executing ... Figure 8 Step S23 is shown.

[0146] Determining unit 104 is configured to determine a detection box of a target object in a target image based on a sequence of second feature vectors, for example, by performing... Figure 8 Step S24 is shown.

[0147] Training unit 105 is configured to train a target tracking model based on prediction results, for example, by performing... Figure 8 Step S25 is shown.

[0148] In some embodiments, the second generation unit 103 is further configured to: utilize the first fully connected layer to calculate element E based on the i-th first feature vector. [i,j] offset δ j , where δ j It is an integer; based on the offset δ j Calculate k.

[0149] In some embodiments, the second generation unit 103 is further configured to: based on i and δ j The sum of and , determine k.

[0150] According to some embodiments of the present disclosure, the training apparatus for the target tracking model improves the accuracy of target tracking by finding relevant elements and using the values ​​of the relevant elements during the feature extraction process, fusing channel-level features in the second feature vector, aggregating channel information, and performing target tracking based on the channel information.

[0151] Figure 11 Block diagrams of electronic devices according to other embodiments of the present disclosure are shown.

[0152] like Figure 11 As shown, the electronic device 11 includes a memory 1101 and a processor 1102 coupled to the memory 1101. The memory 1101 is used to store methods for executing target tracking methods or training methods for target tracking models. The processor 1102 is configured to execute target tracking methods or training methods for target tracking models in any of the embodiments of this disclosure based on instructions stored in the memory 1101.

[0153] Figure 12 A block diagram of a computer system for implementing some embodiments of the present disclosure is shown.

[0154] like Figure 12 As shown, the computer system 120 can be represented in the form of a general computing device. The computer system 120 includes a memory 1210, a processor 1220, and a bus 1200 connecting different system components.

[0155] The memory 1210 may include, for example, system memory, non-volatile storage media, etc. The system memory may store, for example, an operating system, application programs, a boot loader, and other programs. The system memory may include volatile storage media, such as random access memory (RAM) and / or cache memory. The non-volatile storage media may store, for example, instructions for executing the target tracking method or the training method of the target tracking model in any of the embodiments of this disclosure. Non-volatile storage media include, but are not limited to, disk storage, optical storage, flash memory, etc.

[0156] The processor 1220 can be implemented using a general-purpose processor, a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA) or other programmable logic devices, discrete hardware components such as discrete gates or transistors. Accordingly, each module, such as the decision module and the determination module, can be implemented by the central processing unit (CPU) running instructions in memory to execute the corresponding steps, or by dedicated circuitry to execute the corresponding steps.

[0157] Bus 1200 can use any of the various bus architectures. For example, bus architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MCA) bus, and Peripheral Component Interconnect (PCI) bus.

[0158] The computer system 120 may also include an input / output interface 1230, a network interface 1240, and a storage interface 1250. These interfaces 1230, 1240, and 1250, along with the memory 1210 and the processor 1220, can be connected via a bus 1200. The input / output interface 1230 provides a connection interface for input / output devices such as a monitor, mouse, and keyboard. The network interface 1240 provides a connection interface for various networked devices. The storage interface 1250 provides a connection interface for external storage devices such as floppy disks, USB flash drives, and SD cards.

[0159] Hereinafter, various aspects of this disclosure are described with reference to flowchart illustrations and / or block diagrams of target tracking methods, target tracking model training methods, apparatus, and computer program products according to embodiments of this disclosure. It should be understood that each block of the flowchart illustrations and / or block diagrams, and combinations thereof, can be implemented by computer-readable program instructions.

[0160] These computer-readable program instructions are provided to a processor of a general-purpose computer, a special-purpose computer, or other programmable device to produce a machine, such that execution of the instructions by the processor produces means for implementing the functions specified in one or more boxes of the flowchart and / or block diagram.

[0161] These computer-readable program instructions are also readablely stored in a computer-readable storage medium. These instructions cause a computer to work in a particular manner to produce an article of manufacture, including instructions that implement the functions specified in one or more boxes in a flowchart and / or block diagram.

[0162] This disclosure may take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects.

[0163] The target tracking method, target tracking model training method and apparatus, and computer-readable storage medium described in the above embodiments improve the efficiency and accuracy of target tracking.

[0164] This concludes the detailed description of the target tracking method, target tracking model training method and apparatus, and computer-readable storage medium according to the present disclosure. To avoid obscuring the concept of this disclosure, some details known in the art have not been described. Those skilled in the art will fully understand how to implement the technical solutions disclosed herein based on the above description.

Claims

1. A target tracking method, comprising: Obtain an image sequence, wherein the image sequence includes the target image and multiple images preceding the target image; Based on the sequence of detection bounding boxes of target objects in multiple images, a sequence of first feature vectors is generated, including: for each target object, the sequence of past... The observed value of the position of the target object in the image at each time point is converted into a feature vector, and the feature vector is used as the first feature vector of the target object. The sequence of the first feature vectors includes N first feature vectors, and each first feature vector includes M elements corresponding one-to-one with M channels, where N and M are both positive integers. Based on the sequence of the first feature vector, generate a sequence of the second feature vector, including... For the sequence of the first feature vector, the first... i The first eigenvector of the first feature vector The element corresponding to each channel Using the first fully connected layer, according to the first The first eigenvector is used to calculate the element. offset ,according to and The sum, determine ,in, Integer, related elements The sequence representing the first eigenvector is the first eigenvector. The first eigenvector of the first feature vector The element corresponding to each channel , , All are natural numbers. , k N , j M , and They are not equal; According to relevant elements The value of determines the first eigenvector in the sequence of the second eigenvector. i The second eigenvector of the th eigenvector The value of the element corresponding to the channel, where the th element in the sequence of the second feature vector. The second eigenvector of the th eigenvector The value of the element corresponding to each channel and the element The values ​​are positively correlated; Based on the sequence of the second feature vectors, the detection box of the target object in the target image is determined.

2. The target tracking method according to claim 1, wherein, Based on the sequence of the second feature vectors, the detection box of the target object in the target image is determined, including: Using the second fully connected layer, based on the sequence of the second feature vector, the first... i The sequence of generating the third feature vector from the second feature vector. i The third eigenvector; Based on the sequence of the third feature vectors, the detection box of the target object in the target image is determined.

3. The target tracking method according to claim 2, wherein, Based on the sequence of the third feature vectors, the detection bounding box of the target object in the target image is determined, including: Generate a sequence of fourth feature vectors based on the sequence of the first feature vector and the sequence of the third feature vector; Using an attention network, a sequence of fifth feature vectors is generated based on the sequence of detection boxes of target objects in multiple images; Based on the sequence of the fourth feature vector and the sequence of the fifth feature vector, the detection box of the target object in the target image is determined.

4. The target tracking method according to claim 3, wherein, Based on the sequences of the first and third feature vectors, a sequence of the fourth feature vector is generated, including: According to the i The first eigenvector and the... i The weighted sum of the third eigenvectors generates the fourth eigenvector in the sequence. i The fourth eigenvector.

5. The target tracking method according to claim 3, wherein, Based on the sequence of the fourth feature vector and the sequence of the fifth feature vector, the detection box of the target object in the target image is determined, including: The detection box of the target object in the target image is determined by the weighted sum of the sequences of the fourth and fifth feature vectors.

6. The target tracking method according to claim 1, wherein a sequence of first feature vectors is generated based on a sequence of detection boxes of target objects in multiple images, comprising: For multiple images, the first i Calculate the i-th image. i The attributes of the detection bounding box of the target object in the first image are relative to those of the second image. i-1 The first change in the attribute of the detection box of the target object in the image; According to the i The attributes of the detection bounding box of the target object in the first image are relative to those of the second image. i-1 The first change in the attribute of the detection box of the target object in the nth image determines the nth i Features of the detection bounding boxes in an image; According to the i The features of the detection bounding boxes in the nth image are used to generate the nth image. i The first eigenvector.

7. The target tracking method according to claim 6, wherein, The attribute includes at least one of the following: The height, width, width-to-height ratio of the detection frame, and coordinates of the center point of the detection frame.

8. The target tracking method according to claim 1, wherein, Multiple images, including the image preceding the target image, are used to determine the detection bounding box of the target object in the target image based on the sequence of second feature vectors, including: Based on the sequence of second feature vectors, determine a second change in the attribute of the detection box of the target object in the target image relative to the attribute of the detection box of the target object in the previous image; The detection box of the target object in the target image is determined based on the second change and the detection box of the target object in the previous image of the target image.

9. A method for training a target tracking model, comprising: Obtain an image sequence, wherein the image sequence includes the target image and multiple images preceding the target image; Using a target tracking model, a sequence of first feature vectors is generated based on the sequence of detection boxes of target objects in multiple images, including: for each target object, the sequence of past... The observed value of the position of the target object in the image at each time point is converted into a feature vector, and the feature vector is used as the first feature vector of the target object. The sequence of the first feature vectors includes N first feature vectors, and each first feature vector includes M elements corresponding one-to-one with M channels, where N and M are both positive integers. Based on the sequence of the first feature vector, generate a sequence of the second feature vector, including... For the sequence of the first feature vector, the first... i The first eigenvector of the first feature vector The element corresponding to each channel Using the first fully connected layer, according to the first The first eigenvector is used to calculate the element. offset ,according to and The sum, determine ,in, Integer, related elements The sequence representing the first eigenvector is the first eigenvector. k The first eigenvector of the first feature vector The element corresponding to each channel , , All are natural numbers. k N,j M, and They are not equal; According to relevant elements The value of determines the first eigenvector in the sequence of the second eigenvector. i The second eigenvector of the th eigenvector The value of the element corresponding to the channel, where the th element in the sequence of the second feature vector. The second eigenvector of the th eigenvector The value of the element corresponding to each channel and the element The values ​​are positively correlated; Based on the sequence of the second feature vectors, a predicted bounding box for the target object in the target image is generated; The target tracking model is trained based on the prediction results.

10. A target tracking device, comprising: The acquisition unit is configured to acquire an image sequence, wherein the image sequence includes a target image and multiple images preceding the target image; The first generation unit is configured to generate a sequence of first feature vectors based on a sequence of detection boxes of target objects in multiple images, including: for each target object, taking past... The observed value of the position of the target object in the image at each time point is converted into a feature vector, and the feature vector is used as the first feature vector of the target object. The sequence of the first feature vectors includes N first feature vectors, and each first feature vector includes M elements corresponding one-to-one with M channels, where N and M are both positive integers. The second generation unit is configured to generate a sequence of second feature vectors based on the sequence of first feature vectors, including... For the sequence of the first feature vector, the first... i The first eigenvector of the first feature vector The element corresponding to each channel Using the first fully connected layer, according to the first The first eigenvector is used to calculate the element. offset ,according to and The sum, determine ,in, Integer, related elements The sequence representing the first eigenvector is the first eigenvector. k The first eigenvector of the first feature vector The element corresponding to each channel , , All are natural numbers. , k N , j M , and They are not equal; According to relevant elements The value of determines the first eigenvector in the sequence of the second eigenvector. i The second eigenvector of the th eigenvector The value of the element corresponding to the channel, where the th element in the sequence of the second feature vector. The second eigenvector of the th eigenvector The value of the element corresponding to each channel and the element The values ​​are positively correlated; The determining unit is configured to determine the detection box of the target object in the target image based on the sequence of second feature vectors.

11. A training device for a target tracking model, comprising: The acquisition unit is configured to acquire an image sequence, wherein the image sequence includes a target image and multiple images preceding the target image; The first generation unit is configured to use a target tracking model to generate a sequence of first feature vectors based on a sequence of detection boxes of target objects in multiple images, including: for each target object, taking past... The observed value of the position of the target object in the image at each time point is converted into a feature vector, and the feature vector is used as the first feature vector of the target object. The sequence of the first feature vectors includes N first feature vectors, and each first feature vector includes M elements corresponding one-to-one with M channels, where N and M are both positive integers. The second generation unit is configured to generate a sequence of second feature vectors based on the sequence of first feature vectors, including... For the sequence of the first feature vector, the first... i The first eigenvector of the first feature vector The element corresponding to each channel Using the first fully connected layer, according to the first The first eigenvector is used to calculate the element. offset ,according to and The sum, determine ,in, Integer, related elements The sequence representing the first eigenvector is the first eigenvector. k The first eigenvector of the first feature vector The element corresponding to each channel , , All are natural numbers. , k N , j M , and They are not equal; According to relevant elements The value of determines the first eigenvector in the sequence of the second eigenvector. i The second eigenvector of the th eigenvector The value of the element corresponding to the channel, where the th element in the sequence of the second feature vector. The second eigenvector of the th eigenvector The value of the element corresponding to each channel and the element The values ​​are positively correlated; The determining unit is configured to determine the detection box of the target object in the target image based on the sequence of the second feature vectors; The training unit is configured to train the target tracking model based on the prediction results.

12. An electronic device, comprising: Memory; as well as A processor coupled to the memory, the processor being configured to execute the target tracking method according to any one of claims 1 to 8, or the target tracking model training method according to claim 9, based on instructions stored in the memory.

13. A computer-readable storage medium having stored thereon computer program instructions that, when executed by a processor, implement the target tracking method according to any one of claims 1 to 8, or the training method for the target tracking model according to claim 9.