A real-time target tracking method of hierarchical twin network
By using the pyramid feature fusion and position-aware prediction module of hierarchical Siamese networks, the problems of feature representation and position prediction in complex scenarios of Siamese networks are solved, achieving higher tracking accuracy and robustness while maintaining real-time tracking speed.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- QINGDAO UNIV OF TECH
- Filing Date
- 2022-09-27
- Publication Date
- 2026-06-12
AI Technical Summary
Existing twin network target tracking methods struggle to fully utilize multi-level feature information when faced with complex scenarios such as similar interference, rapid movement, and occlusion, resulting in insufficient feature representation capabilities and inaccurate position prediction, especially performing poorly in learning from difficult samples.
A hierarchical Siamese network is adopted, which integrates multi-level features through a pyramid feature fusion module and combines it with a location-aware prediction module for learning difficult samples. An improved ResNet-50 backbone network is used to extract features, and a pyramid feature fusion module and a location-aware prediction module are constructed to improve the accuracy of feature representation and localization.
By effectively utilizing multi-level feature information, the accuracy and robustness of target tracking are improved. It can adapt to complex scenes, maintain real-time tracking speed, and enhance the model's tracking performance under similar background interference, rapid movement, and occlusion conditions.
Smart Images

Figure CN115661195B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of computer vision technology, and more specifically to a real-time target tracking method using hierarchical twin networks. Background Technology
[0002] Object tracking, a fundamental and challenging research topic in computer vision, has wide applications in video surveillance, autonomous driving, human-computer interaction, and medical diagnosis. Object tracking refers to the ability to accurately predict the target's position and size, among other key information, in subsequent frames, given the initial position of a target in a video sequence. With the development of deep learning, researchers have increasingly applied it to object tracking methods. One approach is to apply pre-trained deep network features to traditional correlation filter trackers; however, while improving tracking accuracy, this often reduces tracking speed, making real-time tracking impossible. Another approach is Siamese network tracking, which, due to its balance between accuracy and speed, has gradually become a mainstream research direction, and its results are playing an increasingly important role in scientific research and applications.
[0003] Although Siamese network-based tracking methods have made significant progress, they may still fail in complex scenarios such as similar interference, rapid movement, and occlusion. Firstly, existing methods often use the features of the last convolutional layer to represent the target. While deep features possess rich semantic characteristics, they do not fully explore the spatial characteristics of lower-level features. Some methods, although using multiple layers for feature representation, may lose feature information at different levels due to using different layers individually or employing interpolation and other fusion methods, failing to fully utilize the characteristics of multi-level features in deep network structures to achieve strong feature representation capabilities. Secondly, in the location prediction process, existing methods typically use the intersection-union ratio (IURR) loss between the target and the ground truth for location regression. Such methods often rely on a high IURR but struggle to handle difficult sample learning problems with little or no overlap with the target. Therefore, designing a target tracking method that can fully utilize the multi-level features of deep networks to improve target feature representation and enhance the model's accurate localization capabilities is an urgent problem to be solved. Summary of the Invention
[0004] To address the shortcomings of existing technologies, this invention provides a real-time target tracking method based on hierarchical Siamese networks. This method proposes a pyramid feature fusion module that integrates multi-level hierarchical features. While preserving the semantic properties of high-level features, it introduces shallow features of the same resolution through a top-down structure and smoothly fuses the fused features, mitigating the feature information loss problem caused by using features from different layers individually or by simple interpolation. A position-aware prediction module is also proposed, which cascades high-level features to low-level features and utilizes position-aware loss for hard sample learning. Through training and optimization, the potential of Siamese networks in tracking is fully explored, improving model performance and enhancing tracking accuracy and robustness.
[0005] This invention is achieved through the following technical solution: a real-time target tracking method using hierarchical twin networks, characterized by comprising the following steps:
[0006] S1: Use a twin sub-network to extract features from the template image and search image of the video frame sequence;
[0007] In the initial frame of the video sequence, a template frame image z is cropped with the target object as the center, and a search frame image x is cropped in the current frame. The template frame image and the search frame image are then fed into the template branch and the search branch in the Siamese sub-network, respectively, for feature extraction.
[0008] S2: Construct a pyramid feature fusion module that integrates multi-level hierarchical features.
[0009] Using the Siamese subnetwork in step 1, features from convolutional layers 3, 4, and 5 are extracted for use in the pyramid feature fusion module, thereby constructing a feature pyramid with information at different levels. The convolutional features of layers 3, 4, and 5 are first reduced by a 1×1 convolution operation to obtain the processed features φ3(x), φ4(x), and φ5(x), as well as φ3(z), φ4(z), and φ5(z). Then, the processed features are added element-wise with the corresponding shallow features in a top-down manner. Finally, a 3×3 convolution operation is used to smooth the fused features of different layers and learn semantic relevance. By fusing multi-level features in a pyramid feature fusion module, fused feature maps of convolutional layers 3, 4, and 5 can be obtained to construct a more discriminative target representation.
[0010] S3: Construct a hierarchical location-aware prediction module
[0011] The location-aware prediction module includes multiple location-aware prediction heads. Each location-aware prediction head includes two subtasks: a classification branch that classifies the target from the background and a regression branch that provides the target bounding box regression.
[0012] For a single location-aware prediction head, the third, fourth, and fifth layer fusion features Φ of the search image x and the template image z obtained by the pyramid feature fusion module in step S2 are used. s (x) and Φ s (z), copied as [Φ s (x)] cls , [Φ s (z)] cls , and [Φ s (x)] reg , [Φ s (z)] reg To divide into classification and regression branches; then calculate the classification feature map. Regression Feature Map Where ★ represents cross-correlation operation, [Φ s (x)] cls 、[Φ s (z)] cls The feature maps used for the classification branch are respectively the copied fused features obtained from the search frame image x and the template frame image z. s (x)] reg , [Φ s (z)] reg The third, fourth, and fifth layer fusion features Φ of the search frame image x and the template frame image z obtained by the pyramid feature fusion module are respectively represented. s (x) and Φ s (z), a copied feature map used for the regression branch; These represent classification feature maps and regression feature maps, respectively.
[0013] Since single-level prediction heads may degrade tracking performance when faced with similar interference or significant target changes, multi-level prediction heads are used to construct a position-aware prediction module to progressively refine the target's position and changes. This allows us to obtain the feature maps of the classification and regression branches of the position-aware prediction module. and Where w cls and w reg These represent the weights of each prediction head in the classification and regression branches, respectively.
[0014] S4: Model Training and Online Tracking
[0015] In model training, a large dataset is used for end-to-end training. Template images and search image training pairs are cropped in one go to train the hierarchical Siamese network. During training, the stochastic gradient descent method is used to optimize the network parameters and gradually reduce the overall loss of the proposed hierarchical Siamese network until the model performance no longer improves.
[0016] In online tracking, given a video sequence to be tracked, template frame images are acquired as described in step 1, and template features are extracted using a Siamese subnetwork. In subsequent sequence frames, search frame features are extracted based on the tracking results of the previous frame. After obtaining the template frame and search frame image features, they are fed into the pyramid feature fusion module to obtain fused low-level, mid-level, and high-level features, respectively. The obtained fused features are then input into the three position-aware prediction heads of the hierarchical position-aware prediction module to obtain three classification feature maps and three regression feature maps. The three classification feature maps and three regression feature maps are then weighted and fused to obtain the fused classification and regression results, thereby obtaining the target prediction box for the current frame. The prediction box with the highest score is selected as the prediction result for the current frame.
[0017] Furthermore, in step S1, the template frame image z has a size of 127×127×3, and the search frame image x has a size of 255×255×3.
[0018] Furthermore, in step S1, the Siamese sub-network uses an improved ResNet-50 as its backbone network. The improvement to ResNet-50 is as follows: the stride of convolutional layers 4 and 5 is set to 1 to increase the spatial size of the feature map and retain more detailed information, while dilation rates of 2 and 4 are used respectively to increase the receptive field. Thus, the i-th convolutional layer features of the template frame and the search frame are obtained based on the Siamese sub-network. and
[0019] Furthermore, in step S2, the feature calculation formula for the fused four convolutional layers is as follows: as well as Where Φ5(z) = φ5(z) represents the features after five convolutional layers are fused, Φ4(z) represents the features after four convolutional layers are fused from the template frame image, and Φ4(x) represents the features after four convolutional layers are fused from the search frame image; the feature calculation formula after three convolutional layers is the same as that after four convolutional layers, and the fifth layer is directly output after a 1×1 convolution.
[0020] Further, in step S3, in the generated classification feature map, each point represents the confidence level of a positive or negative sample; in the generated regression feature map, each point represents the offset value between the predicted value and the true labeled bounding box. The intersection-over-union (IoU) ratio between the predicted value A and the true labeled B is calculated based on the offset value. Based on the IoU ratio, the regression loss function is defined as... in The outer bounding box C is the smallest bounding box that contains A and B, L reg This represents the regression loss function.
[0021] Furthermore, in step S3, for a single prediction head, its loss function is expressed as L = λ1L cls +λ2L reg Where λ1 and λ2 are trade-off parameters and are set to 1, the overall loss function of the hierarchical Siamese network is L = λ1∑ s L cls +λ2∑ s L reg Where s represents the number of cascades, and L cls Cross-entropy loss is used for classification.
[0022] Furthermore, in step S4, the large dataset includes COCO, ImageNet DET, and ImageNet VID.
[0023] Furthermore, in step S4, the specific method for optimizing network parameters using stochastic gradient descent and gradually reducing the overall loss of the proposed hierarchical Siamese network includes: during training, the network parameters are optimized using stochastic gradient descent for a total of 20 iterations, with each iteration using a batch size of 28 for calculation and estimation. In the first 5 iterations, the learning rate is increased from 0.001 to 0.005; in the last 15 iterations, the learning rate is decreased from 0.005 to 0.0005, thus gradually reducing the overall loss of the proposed hierarchical Siamese network until the model's performance no longer improves.
[0024] Furthermore, in step S4, the fused classification result and regression result are obtained. The classification result represents the classification score at each position, and the regression result represents the predicted target box description. The regression position corresponding to the maximum classification score is the target prediction box of the current frame.
[0025] The beneficial effects of this invention are as follows:
[0026] This invention provides a real-time target tracking method based on hierarchical Siamese networks. This method, through a cascaded architecture, effectively utilizes multi-level features and performs hard sample learning to achieve accurate localization, adapting well to complex situations such as similar background interference, rapid movement, and occlusion. Specifically, the pyramid feature fusion module of this invention achieves multi-level feature fusion, fusing low-level features of the same resolution in a top-down manner while preserving high-level speech information, and smoothly fusing the fused features, ensuring effective fusion of features from different levels. Furthermore, due to its simple structure, it maintains real-time tracking speed without incurring significant computational loss. Simultaneously, this invention utilizes a position-aware prediction module to learn hard samples, further improving the localization accuracy of the tracking. Attached Figure Description
[0027] The invention will now be further described with reference to the accompanying drawings.
[0028] Figure 1 This is a diagram of the overall network architecture of the present invention;
[0029] Figure 2 This is a diagram of the pyramid feature fusion module architecture of the present invention. Detailed Implementation
[0030] The present invention will now be described in further detail with reference to the accompanying drawings and specific embodiments.
[0031] This invention provides a hierarchical twin network real-time tracking method, the overall network architecture of which is as follows: Figure 1 As shown, the invention comprises three parts: a Siamese subnetwork, a Pyramid feature fusion module, and a Location-aware prediction module. The Siamese subnetwork is primarily responsible for extracting shallow and deep features from the template frame and search frame; the Pyramid feature fusion module is responsible for fusing multi-level features to obtain a discriminative target representation; and the Location-aware prediction module is responsible for cascading each Location-aware prediction head to sequentially refine the target position and target changes, and introduces Location-aware loss to learn from difficult samples, ensuring the accuracy of target tracking. The invention will be further described in detail below with reference to the accompanying drawings and specific embodiments.
[0032] 1. Feature extraction is performed on the template image and search image of the video frame sequence using a Siamese subnetwork;
[0033] In the initial frame of the video sequence, a 127×127×3 template frame image z is cropped with the target object as the center, and a 255×255×3 search frame image x is cropped from the current frame. The Siamese subnetwork contains a template branch and a search branch. The two branches have the same network structure and share the same parameters. The template frame image and the search frame image are fed into the template branch and the search branch, respectively, for feature extraction. In the subsequent tracking task, they are sent to a common embedding space for similarity learning.
[0034] In this embodiment, the Siamese sub-network uses an improved ResNet-50 as the backbone network. To make ResNet-50 suitable for the hierarchical Siamese network tracking method proposed in this invention, this embodiment reduces the stride to obtain higher spatial resolution and uses dilated convolutions to increase the receptive field. Therefore, based on the Siamese sub-network, different convolutional layer features of the template frame and search frame can be obtained separately. The specific operation is as follows:
[0035] The stride of convolutional layers 4 and 5 is set to 1 to increase the spatial size of the feature maps and retain more detailed information. Dilation rates of 2 and 4 are used respectively to increase the receptive field. Therefore, based on the Siamese subnetwork, the features of the i-th convolutional layer in the template frame and search frame can be obtained separately. and
[0036] II. Constructing a Pyramid Feature Fusion Module that Integrates Multi-Level Hierarchical Features
[0037] Using the twin subnetwork in step 1, features from convolutional layers 3, 4, and 5 are extracted for use in the pyramid feature fusion module, thereby constructing a feature pyramid with information at different levels. Hierarchical representations can be obtained from these feature maps.
[0038] These three feature layers achieve the same spatial resolution using operations such as dilated convolutions in Siamese networks, but can capture different levels of information depending on their receptive fields. While preserving the semantic information of high-level features, the pyramid feature fusion module introduces low-level information with the same resolution and smooths the fused features, thereby mitigating the information gap between different levels caused by simple interpolation fusion strategies or the use of different feature layers individually.
[0039] Pyramid feature fusion module, such as Figure 2 As shown, the convolutional features of layers three, four, and five are first reduced by a 1×1 convolution operation to obtain the processed features φ3(z), φ4(z), and φ5(z). Then, the processed features are added element-wise with the corresponding shallow features in a top-down manner. Finally, a 3×3 convolution operation is used to smoothly fuse the features from different layers and learn semantic relevance.
[0040] Taking a four-layer convolutional layer as an example, the formula for calculating the fused features is as follows: as well as Where Φ5(z) = φ5(z) represents the features after fusion of five convolutional layers; the feature calculation formula after fusion of three convolutional layers is the same as that of four convolutional layers, and the fifth layer is directly output after a 1×1 convolution. Thus, by fusing multi-level features step by step through the pyramid feature fusion module, fused feature maps of convolutional layers three, four, and five can be obtained respectively, to construct a more discriminative target representation.
[0041] This pyramid feature fusion module has a simple structure and does not incur a large computational burden, thus ensuring real-time tracking. Therefore, by fusing features at multiple levels step by step using the pyramid feature fusion module, fused feature maps from convolutional layers three, four, and five can be obtained respectively.
[0042] III. Constructing a Cascaded Location-Aware Prediction Module
[0043] The location-aware prediction module includes multiple location-aware prediction heads. Each location-aware prediction head includes two subtasks: a classification branch that classifies the target from the background, and a regression branch that provides the bounding box regression.
[0044] For a single location-aware prediction head, this invention uses the third, fourth, and fifth layer fusion features Φ of the search image x and the template image z obtained by the pyramid feature fusion module in step 2. s (x) and Φ s (z), copied as [Φ s (x)] cls , [Φ s (z)] cls , and [Φ s (x)] reg , [Φ s (z)] reg The classification and regression branches can then be used to calculate the classification feature map. Regression Feature Map The asterisk (*) represents a cross-correlation operation.
[0045] Since single-level prediction heads may degrade tracking performance when faced with similar interference or significant target changes, multi-level prediction heads are used to construct a position-aware prediction module to progressively refine the target's position and changes. This allows us to obtain the feature maps of the classification and regression branches of the position-aware prediction module. and Where w cls and w regThese represent the weights of each predictor head in the classification and regression branches, respectively.
[0046] In the generated classification feature map, each point represents the confidence score of a positive or negative sample; in the generated regression feature map, each point represents the offset values between the predicted value and the ground-truth bounding box. The intersection of union (IoU) between the predicted value A and the ground-truth B can be calculated based on these offset values. The regression loss function is then defined as follows: in The outer bounding box C is the smallest bounding box that contains both A and B. For a single prediction head, its loss function can be expressed as L = λ1L cls +λ2L reg , where λ1 and λ2 are trade-off parameters and are empirically set to 1.
[0047] The hierarchical location-aware prediction module utilizes multiple prediction heads cascaded together to leverage multi-layer features obtained from the pyramid feature fusion module. The first stage directly utilizes the features of the last layer of the Siamese sub-network, the s-th stage receives fused features from a certain layer and higher layers, and so on. Therefore, the overall loss function of the hierarchical Siamese network is L = λ1∑ s L cls +λ2∑ s L reg Where s represents the number of cascades, and L cls Cross-entropy loss is used for classification.
[0048] IV. Model Training and Online Tracking
[0049] In model training, this invention uses large datasets such as COCO, ImageNet DET, and ImageNet VID, which provide high-quality annotations, for end-to-end training. Before the video frame images from these datasets are fed into the network for training, they are cropped to obtain template frame images of 127×127 and search frame images of 255×255. During training, stochastic gradient descent is used to optimize the network parameters for a total of 20 iterations, each with a batch size of 28 for computation and estimation. In the first 5 iterations, the learning rate is increased from 0.001 to 0.005; in the subsequent 15 iterations, the learning rate is decreased from 0.005 to 0.0005, thus gradually reducing the overall loss of the proposed hierarchical Siamese network until the model's performance no longer improves.
[0050] In online tracking, given a video sequence to be tracked, an optimized hierarchical Siamese network is used for automatic tracking to obtain the tracking result. The first frame of the given video sequence is used as the template frame image, and subsequent video frame images are used as search frame images. These are fed into a weight-shared Siamese sub-network for multi-layer feature extraction. After obtaining the features of the template frame and search frame images, they are fed into a pyramid feature fusion module to obtain fused low-level, mid-level, and high-level features, respectively. The obtained fused features are then input into the three position-aware prediction heads of the hierarchical position-aware prediction module to obtain three classification feature maps and three regression feature maps. The three classification feature maps and three regression feature maps are then weighted and fused to obtain the fused classification result and regression result. The classification result represents the classification score at each position, and the regression result represents the predicted target box description. The regression position corresponding to the maximum classification score is the target result for the current frame.
[0051] The above description is merely a preferred embodiment of the present invention and is not intended to limit the invention. Various modifications and variations can be made to the present invention by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the scope of protection of the present invention.
Claims
1. A real-time target tracking method using hierarchical Siamese networks, characterized in that, Includes the following steps: S1: Use a Siamese subnetwork to extract features from the template frame image and the search frame image of the video frame sequence. In the initial frame of the video frame sequence, a template frame image z is cropped with the target object as the center, and a search frame image x is cropped in the current frame. The template frame image and the search frame image are then fed into the template branch and the search branch in the Siamese sub-network, respectively, for feature extraction. S2: Construct a pyramid feature fusion module that integrates multi-level hierarchical features. Using the Siamese subnetwork from step 1, features from convolutional layers 3, 4, and 5 are extracted for use in the pyramid feature fusion module, thus constructing a feature pyramid with information at different levels. The convolutional features from layers 3, 4, and 5 are first reduced in number using a 1×1 convolution operation to obtain the processed features. , and as well as , and Then, the processed features are added element-wise with the corresponding shallow features in a top-down manner. Finally, the different layers of features are smoothly fused using a 3×3 convolution operation and semantic relevance is learned. By fusing multi-level features through a pyramid feature fusion module, the fused feature maps of the third, fourth and fifth convolution layers are obtained to construct a more discriminative target representation. S3: Construct a hierarchical location-aware prediction module The location-aware prediction module includes multiple location-aware prediction heads. Each location-aware prediction head includes two subtasks: a classification branch that classifies the target from the background and a regression branch that provides the target bounding box regression. For a single location-aware prediction head, the third, fourth, and fifth layer fusion features of the search frame image x and the template frame image z obtained by the pyramid feature fusion module in step S2 are used. and Copy , ,as well as , To the classification branch and the regression branch; Then calculate the classification feature map. Regression Feature Map ,in Represents cross-correlation operations. These represent the feature maps used for the classification branch, obtained by copying the fused features from the search frame image x and the template frame image z, respectively. These represent the third, fourth, and fifth layers of fused features obtained from the search frame image x and template frame image z by the pyramid feature fusion module, respectively. The copied feature map is used for the regression branch; These represent classification feature maps and regression feature maps, respectively. Since single-level prediction heads may degrade tracking performance when faced with similar interference or significant target changes, multi-level prediction heads are used to construct a position-aware prediction module to progressively refine the target position and changes. This yields feature maps for the classification and regression branches of the position-aware prediction module. and ,in These represent the weights of each prediction head in the classification and regression branches, respectively. S4: Model Training and Online Tracking In model training, a large dataset is used for end-to-end training. Template frame images and search frame images are cropped in a concentrated manner to train the hierarchical Siamese network. During training, the stochastic gradient descent method is used to optimize the network parameters and gradually reduce the overall loss of the proposed hierarchical Siamese network until the model performance no longer improves. In online tracking, given a sequence of video frames to be tracked, template frame images are acquired as described in step 1, and template features are extracted using a Siamese subnetwork. In subsequent sequence frames, search frame features are extracted based on the tracking results of the previous frame. After obtaining the template frame and search frame image features, they are fed into the pyramid feature fusion module to obtain fused low-level, mid-level, and high-level features, respectively. The obtained fused features are then input into the three position-aware prediction heads of the hierarchical position-aware prediction module to obtain three classification feature maps and three regression feature maps. The three classification feature maps and three regression feature maps are then weighted and fused to obtain the fused classification and regression results, thereby obtaining the target result for the current frame.
2. The real-time target tracking method using a hierarchical twin network according to claim 1, characterized in that: In step S1, the template frame image z has a size of 127×127×3, and the search frame image x has a size of 255×255×3.
3. The real-time target tracking method using a hierarchical twin network according to claim 1, characterized in that: In step S1, the Siamese sub-network uses an improved ResNet-50 as its backbone network. The improvement to ResNet-50 is as follows: the stride of convolutional layers 4 and 5 is set to 1 to increase the spatial size of the feature map and retain more detailed information; at the same time, dilation rates of 2 and 4 are used respectively to increase the receptive field. Thus, the i-th convolutional layer features of the template frame and the search frame are obtained based on the Siamese sub-network. .
4. The real-time target tracking method using a hierarchical twin network according to claim 1, characterized in that: In step S2, the feature calculation formula after the fusion of the four convolutional layers is as follows: as well as ,in This represents the features after fusion from five convolutional layers. The features of the template frame image after four convolutional layers are fused. The features represented by the four convolutional layers fused together from the search frame image. The feature of the search frame image after five convolutional layers is represented; the feature calculation formula after three convolutional layers is the same as that after four convolutional layers, and the fifth layer is directly output after a 1×1 convolution.
5. The real-time target tracking method using a hierarchical twin network according to claim 1, characterized in that: In step S3, in the generated classification feature map, each point represents the confidence level of a positive or negative sample; in the generated regression feature map, each point represents the offset value between the predicted value and the ground truth bounding box. The intersection-over-union (IoU) ratio between the predicted value A and the ground truth B is calculated based on the offset value. The regression loss function is then defined based on the IoU ratio. ,in The outer bounding box C is the smallest bounding box that contains both A and B. This represents the regression loss function.
6. The real-time target tracking method of a hierarchical twin network according to claim 1, characterized in that: In step S3, for a single prediction head, its loss function is expressed as: ,in The tradeoff parameter is set to 1, and the overall loss function of the hierarchical Siamese network is... Where s represents the number of cascades, and Classification is performed using cross-entropy loss. This represents the regression loss function.
7. The real-time target tracking method using a hierarchical twin network according to claim 1, characterized in that: In step S4, the large datasets include COCO, ImageNet DET, and ImageNet VID.
8. The real-time target tracking method of a hierarchical twin network according to claim 1, characterized in that: In step S4, the specific method for optimizing network parameters using stochastic gradient descent and gradually reducing the overall loss of the proposed hierarchical Siamese network includes: during training, the network parameters are optimized using stochastic gradient descent for a total of 20 iterations, with each iteration using a batch size of 28 for calculation and estimation. In the first 5 iterations, the learning rate is increased from 0.001 to 0.005; in the last 15 iterations, the learning rate is decreased from 0.005 to 0.0005, thus gradually reducing the overall loss of the proposed hierarchical Siamese network until the model's performance no longer improves.
9. The real-time target tracking method of a hierarchical twin network according to claim 1, characterized in that: In step S4, the fused classification result and regression result are obtained. The classification result represents the classification score at each position, and the regression result represents the predicted target box description. The regression position corresponding to the maximum classification score is the target prediction box of the current frame.