[0028] As previously described, the present invention proposes a target tracking method based on the task distinguishing and retrofitting a combination network, and the specific embodiment of the present invention will be described below with reference to the accompanying drawings.
[0029] (1) Overall process
[0030] The present invention proposes a task to distinguish the detection of the joint network to realize video multi-objective tracking. The task distinguishes the overall frame map of the joint network figure 1 As shown, three parts: (1) backbone network; (2) multi-feature fusion network; (3) multi-task branch. These three parts are also three steps of the present invention proposed.
[0031] For the current frame of the input, first utilize the DLA backbone network (such as attached figure 2 As shown) Extract sharing features for target detection tasks and target retrieve feature extraction tasks in images, the DLA can output a feature map of 1 to N. The lower the characterization of the lower layer, can better retain the low-level information in the original map, such as edge, texture, spatial distribution, etc., which is more advantageous for the target detection task that needs to be judged; the feature map space information of the higher layer is gradually lost. The high-level semantic information related to the re-identification task is gradually prominent, and it is more suitable for the confirmation of the target identity. Therefore, the present invention proposes to selectively learn the appropriate features based on the task characteristics, and use it as an input to subsequent multi-character fusion networks.
[0032] In a multi-feature fusion network, a targeted fusion is performed according to the different tasks. Specifically, for the target detection task, it is considered that it is more concerned with the target position, the accuracy of the positioning requires more low-level features, so the fusion stage 1 to stage M (M image 3 As shown, when using the IDA module to make a multi-scale feature chart link, it is necessary to perform the upper and lower resolution characteristics, and the interpolation and polymerization of the feature is iteratively, and the characteristics of the plurality of stages are integrated from shallow to deeply. The high-resolution feature of the deeper decoder, the final output depth fusion, the input of the task branch as a subsequent target detection and retrieve feature extraction task branch. DLA outputs 1 to N different stages, n is preferably 4. In a multi-feature fusion, the target detection task fusion phase 1 to the feature map of phase M, M is preferably 3; the target re-identification feature extraction task fusion stage 1 to stage N is characterized, N is preferably 4.
[0033] Finally, the fused feature input target detection task branch and the retrieve feature extraction task branch, each task branch uses different loss function constraints to train to complete the target detection task and target re-identification feature extraction task. In this way, while equilibrium different tasks, it is also considered that different tasks have a focus difference between the target features, and the characteristics of the two tasks are distinguished, and the accuracy of the target detection and the target re-identification feature is improved.
[0034] Among them, the target detection task is composed of thermogram branches, size branches, and offset branches, positioning the target of the current frame; the target re-identification feature extraction branch is based on the target center point position obtained by the target detection task, from the full map of embedded characterization The embedded representation of the target in the vector cube is extracted in the position of the target, which is used for the calculation of the target interpretation, thereby determining the target identity ID, achieving multi-target tracking.
[0035] (2) Backbone network
[0036] A backbone network DLA extracts the sharing features required for the target detection task and the target re-identification feature extraction task.
[0037] The present invention uses a deep-polymerization network (DLA) as a backbone network, and its complete network structure figure 2 As shown, the core module is a hierarchicaldeepaggregation, HDA, indicated by the point line frame, and an iterative depth aggregation, IDA, represented by the point line arrow. The dashed frame in the figure indicates the polymer node, and the dotted arrow indicates the twice the sample process. The HDA module is a hierarchical structure of a tree link, which is capable of better propagating features and gradients, and the IDA module is responsible for linking different phases (STAGE) features. The stage is for each HDA module.
[0038] In the backbone network DLA, each HDA module outputs a corresponding resolution aggregation result, ie figure 2 The polymeric node in the uppermost corner of the midpoint line frame, and the IDA module is linked to these aggregate nodes. On the one hand, the HDA module is fused to semantic information by aggregation in the channel direction, and on the other hand, the IDA module achieves the fusion of spatial information by aggregation in the resolution and scale direction.
[0039] Finally, the DLA output phase 1 is a feature map of different scales of the N stage, and the size is the same size C × H × W, as an input to the subsequent multi-character fusion network. H × W is the resolution of the input image, C is the number of channels.
[0040] (3) Multi-feature fusion network
[0041] For the stage 1 to stage n of the DLA output, the characteristics of different scales of the N stages, using multi-task hierarchical multi-character fusion structures, enabling the target detection task and target re-identification feature extraction task to pass through a multi-feature fusion network Sharing parameters, combining a more advantageous feature of their respective tasks; using multi-task independent multi-character fusion structures, constructing two characteristics of the characteristic fusion network fusion of each other independent, non-shared parameters. The dimension of the obtained fusion characteristics is H / 4 × W / 4 × 64, and H × W is the resolution of the input image of the model.
[0042] (3.1) Multitasking Characteristic Fusion Structure: Section 1 to Stage N of DLA Output Different Scale Features, Select Phase 1 to Phase M As a multi-character fusion network for subsequent target detection tasks Input, selecting a feature map of the phase 1 to stage N as the input of the multi-character fusion network of the subsequent target reconstruction feature extraction task. The target detection task uses a relatively rich low-level characteristics of space information, and the target re-identification feature extraction task further fuses more prominent high-order high-raced features that fuse the identity of the target identity through high and low-level features.
[0043] The multi-task-layered multi-character fusion structure allows the target detection task and the target re-identification feature extraction task to fuse the characteristics of each task more advantageous in a multi-feature fusion network. The multi-feature fusion network uses the IDA feature fusion network, that is, the IDA module is characterized by the ida module. The IDA module has a characteristic map of different stages of the DLA output. When using the IDA module to make a multi-scale feature chart link, the low resolution is required. The rate feature is sampled, and the interpolation and polymerization of the feature is used iteratively, the high-resolution characteristics of a growing decoder and final output depth fusion are formed from shallow to deeply fusion.
[0044] (3.2) Multi-task independent feature fusion structure: a feature map of 1 to phase of DLA output 1 to phase N this N-stage different scales, selecting a feature map of the selection phase 1 to stage M as a multi-character fusion network for subsequent target detection tasks Input, the feature map of the selection phase 1 to stage N is the input of the multi-character fusion network of the subsequent target refer to the feature extraction task. Two independent multi-character fusion networks are respectively constructed for target detection tasks and target re-identification feature extraction tasks, which are independent of each other, non-shared parameters, fusion multi-stage features, respectively, respectively, respectively, for subsequent target detection, respectively. And reconfirmation feature extraction. The multi-feature fusion network uses an IDA feature fusion network.
[0045] (4) Multi-task branch
[0046] After obtaining the fusion feature of distinguishing the target detection task and the reconstruction feature extraction task, these fusion features are input to the target detection task branch, the reconstruction feature extraction task branch, and training through the network branch of the same structure, and the two network branches use different The loss function is constrained, and the dimension of each branch prediction is H / 4 × W / 4 × S, where H × W is the resolution of the input image of the model, and S represents the number of channels corresponding to each branch. Each branch is input as an input, first through a convolution layer, and then activated with a RELU layer, and finally the predicted result is output through a convolution layer.
[0047]In the target detection task branch, the target detection feature of the multi-feature fusion network output is input to the thermogram branch, the size branch and the offset branch, and the thermal map branch of the loss function constraint uses a size adaptive pixel level logic regression loss function, size. Branches and offset branches are trained with L1 loss, and the thermal map branch determines the target center point position, the size branch determines the target length, wide, and the offset branch is accurately positioned the target center point position offset, thereby positioning the current frame The location of the target; the target re-identification feature extraction task branch, the re-identification feature input of the multi-feature fusion network output is embedded in the table symbol, and each target is used as a class, through the convolution layer -relu activation layer - convolution layer, The loss function of the classification task is trained, and the extracted feature is expressed as embedded in the syndrome, and the center point position of the target is obtained according to the target detection task, and the target is extracted from the full map. Extract the target, for the target The apparent similarity is calculated, and the target ID is determined by the similarity calculation result. Finally, the target detection branch positioning target position, the target feature extraction branch matches the extracted characterization vector computing the similarity to the target, and ultimately realizes multi-target tracking.
[0048] The above disclosure is only the specific examples of the present invention, and the thoughts provided in the present invention, those skilled in the art can fall within the scope of the invention.