A frequency domain alignment method for robust visual target tracking under haze weather conditions

By employing frequency domain alignment and consistent distillation mechanisms, combined with feature extraction from both teacher and student networks, the instability of target tracking under hazy weather conditions was resolved, achieving stable tracking in hazy environments.

CN122289712APending Publication Date: 2026-06-26BEIJING INST OF TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
BEIJING INST OF TECH
Filing Date
2026-05-25
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

During target tracking in visible light images under hazy weather conditions, atmospheric scattering and attenuation lead to decreased image contrast, loss of texture details, and weakened edge information, resulting in unstable target representation, difficulty in cross-frame matching, tracking drift, and even tracking failure.

Method used

Teacher and student networks are used to extract features from template and search images. Frequency domain decomposition is performed to obtain low-frequency and high-frequency information. A joint representation is constructed by combining spatial domain features. The stability of feature representation is improved by frequency domain representation alignment mechanism and enhanced consistency distillation mechanism. Frequency domain-aware embedding is constructed to optimize the target tracking model.

Benefits of technology

Under hazy weather conditions, it significantly improves the accuracy and stability of target tracking, reduces tracking drift, and enhances the model's ability to locate targets in hazy environments.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122289712A_ABST
    Figure CN122289712A_ABST
Patent Text Reader

Abstract

A frequency domain alignment method for robust visual target tracking under hazy weather conditions is disclosed, belonging to the field of intelligent driving environmental perception. The method involves: extracting features from template and search images using teacher and student networks; performing frequency domain decomposition on the extracted feature maps to extract low-frequency and high-frequency features; simultaneously extracting global and local features in the spatial domain; and fusing these features to construct a frequency domain-aware embedding. Frequency domain representation alignment is performed using the frequency domain-aware embeddings from the teacher and student networks; the frequency domain alignment loss is calculated; and consistency distillation is performed on the enhanced view to optimize the target response distribution. The student network is trained using the total loss function to obtain a trained target tracking model; and frequency domain alignment of target tracking is achieved based on the trained target tracking model. This invention improves the accuracy and robustness of target tracking under hazy weather conditions.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to a frequency domain alignment method for robust visual target tracking under hazy weather conditions, belonging to the field of intelligent driving environmental perception. Background Technology

[0002] Intelligent driving vehicles acquire information about surrounding objects such as vehicles and pedestrians through sensors. Among these, visible light cameras can capture information about the appearance, texture, contour, and motion changes of targets in the scene, making them an effective source of information for environmental perception in intelligent driving. By modeling and matching the appearance features of targets in video sequences, continuous localization of targets in subsequent frames can be achieved, which can be used to develop various types of tracking algorithms.

[0003] In hazy weather conditions, suspended particles in the atmosphere scatter and attenuate propagating light, leading to decreased contrast, loss of texture details, weakened edge information, and blurred target appearance in images acquired by visible light cameras. These image degradation issues significantly interfere with visual target tracking based on visible light images, affecting the accuracy of intelligent driving target tracking algorithms under hazy conditions, and consequently causing target localization errors, tracking drift, or even tracking failure. Optimizing visible light image target tracking methods under hazy conditions can improve the reliability of visual perception systems and maintain the stability of intelligent driving environmental perception algorithms.

[0004] Existing research on visual target tracking under hazy weather conditions rarely involves frequency domain methods. Wu H, Yao S, Huang F, et al. Lvptrack: High performance domain adaptive uav tracking with label aligned visual prompt tuning[C] / / Proceedings of the AAAI Conference on Artificial Intelligence. 2025, 39(8): 8395-8403. By adaptively adjusting the feature representation through learnable prompts, the impact of haze degradation on haze representation can be reduced. This method mainly relies on spatial domain features, which is difficult to fully compensate for the loss of high-frequency detail information and structural information under hazy conditions. Gao Y, Xu W, Lu Y. Let you see in haze and sandstorm: Two-in-one low-visibility enhancement network[J]. IEEE Transactions on Instrumentation and Measurement, 2023, 72: 1-12. Dehazing enhancement was achieved by reconstructing the correlation between color channels based on a multilayer perceptron module; Chen T, Fu J, Jiang W, et al. SRKTDN: Applying super resolution method to dehazing task[C] / / Proceedingsof the IEEE / CVF Conference on Computer Vision and Pattern Recognition. 2021:487-496. By combining super-resolution detail recovery to improve image quality under non-uniform fog conditions, both enhancement methods aim to improve perceptual quality, but the improvement of perceptual quality does not necessarily mean that the key texture and structural information required for target tracking can be effectively recovered.Under hazy imaging conditions, low-frequency components usually dominate while high-frequency details are suppressed. Therefore, frequency domain modeling is of great significance for recovering trackable texture and edge information. Tang C, Wang X, Bai Y, et al. Learning spatial-frequency transformer for visual object tracking[J].IEEE Transactions on Circuits and Systems for Video Technology, 2023, 33(9):5102-5116. Frequency domain information is used as part of representation learning to improve tracking robustness, but there is a lack of explicit constraints on the relationship between frequency domain representation and spatial domain representation, making it difficult to form a stable and discriminative target representation under hazy degradation conditions. Summary of the Invention

[0005] To address the problems of decreased image contrast, loss of texture details, weakened edge information, and degraded target appearance caused by atmospheric scattering and attenuation during visible light image target tracking in hazy weather conditions, leading to unstable target representation, difficulty in cross-frame matching, tracking drift, and even tracking failure, this invention aims to provide a frequency domain alignment method for robust visual target tracking under hazy weather conditions. The method involves feature extraction from the input template image and search image; frequency domain decomposition of the extracted features to obtain low-frequency and high-frequency information; construction of a joint representation combining spatial and frequency domain features; and improvement of feature representation stability in degraded scenarios through a frequency domain representation alignment mechanism. Furthermore, an enhanced view is constructed for the search region, and an enhanced consistency distillation mechanism is introduced to constrain the consistency of high-confidence responses, thereby improving the model's target localization capability and tracking robustness under hazy weather conditions. In practical applications, the video sequence to be tracked is input into the trained target tracking model, which outputs the predicted position of the target in the current frame, achieving stable target tracking under hazy weather conditions.

[0006] The objective of this invention is achieved through the following technical solutions.

[0007] This invention discloses a frequency domain alignment method for robust visual target tracking under hazy weather conditions, comprising the following steps:

[0008] Step 1: Use the teacher network and student network to extract feature maps from the training samples; obtain the teacher branch feature map output by the teacher network and the student branch feature map output by the student network.

[0009] Furthermore, a teacher network and a student network are constructed, each including a backbone feature extraction network and a target prediction head. The backbone feature extraction network adopts the ViT backbone network, which is used to perform deep feature extraction on the template image and the search image respectively, thereby obtaining the corresponding feature map representations.

[0010] Let the input template image be denoted as . The input search image is denoted as After extraction through the teacher network and student network, teacher branch feature maps and student branch feature maps are obtained respectively. For ease of description, they can be denoted as:

[0011] (1) (2)

[0012] in, This represents the feature extraction mapping of the teacher network. This represents the feature extraction mapping of the student network. and These represent the feature maps output by the teacher network and the student network, respectively.

[0013] In this invention, the teacher network provides a more stable and discriminative feature or response distribution, while the student network serves as the actual optimization object, learning a target representation with stronger robustness to haze degradation under supervised constraints. In step one, standard tracking training samples are constructed using template images and search images; simultaneously, the teacher-student dual-network architecture provides a representational basis and source of supervision for subsequent frequency domain representation alignment and enhanced consistency distillation.

[0014] Furthermore, training samples are obtained, including template images and search images, which contain haze degradation features. The haze degradation features are mainly manifested as decreased image contrast, weakened texture details, blurred edge contours, and degraded target appearance information. The template image is used to provide prior appearance information of the target to be tracked, and the search image is used to provide candidate region information of the target to be tracked in the current frame.

[0015] Step 2: Perform discrete wavelet transform on the teacher and student branch feature maps obtained in Step 1 to obtain low-frequency and high-frequency features respectively; use spatial domain context modeling mechanism to extract global and local features; finally, fuse the low-frequency features, high-frequency features, global features and local features to obtain frequency domain-aware embedding and frequency domain alignment loss.

[0016] 1) Perform frequency domain decomposition on the input feature map. Let any input feature map be denoted as:

[0017] (3)

[0018] in, Indicates the number of channels. Indicates the feature map height. Indicates the width of the feature map.

[0019] Perform Haar discrete wavelet transform on each channel of the feature map. For each channel's corresponding two-dimensional feature... First, perform a Haar wavelet transform along the row direction to obtain low-frequency approximation coefficients and high-frequency detail coefficients:

[0020] (4) (5)

[0021] in, Indicates the first The low-frequency approximation coefficients obtained after performing Haar row transform on the rows. This represents the corresponding high-frequency detail coefficients; For row index, The index is the column index after sampling with a step size of 2.

[0022] For the low-frequency approximation coefficients respectively With high frequency detail coefficient Continuing the Haar wavelet transform along the column direction yields a low-frequency subband and three high-frequency subbands in different directions:

[0023] (6)

[0024] (7)

[0025] (8)

[0026] (9)

[0027] in, Indicates low-frequency components. Represents the horizontal high-frequency components. Represents the vertical high-frequency components. This represents the diagonal high-frequency components. Further, convolution mapping and normalization are performed on the low-frequency and high-frequency components to obtain the final low-frequency and high-frequency features:

[0028] (10)

[0029] (11)

[0030] in, This indicates low-frequency characteristics, primarily representing the overall structural information and background trend information of the target; It represents high-frequency features, mainly characterizing target edges, textures, and local details; This indicates a splicing operation. express Convolution operation, This indicates batch normalization operation.

[0031] Through the above design, the frequency domain information in the input feature map is explicitly decomposed into different frequency bands, so that the network can not only utilize the relatively stable low-frequency structural information in the scenario of haze detail degradation, but also retain high-frequency edge and texture information as much as possible to enhance the integrity of the target representation.

[0032] 2) In the spatial domain, to supplement frequency domain information and improve the discriminative power of the representation, both global and local features are extracted from the input feature map. Given the input features... Firstly, adopt The convolutional layer performs downsampling to obtain intermediate feature maps. This operation reduces computational overhead and expands the receptive field, providing a more suitable feature base for subsequent global and local modeling.

[0033] A global attention module is introduced in the global branch, and a local attention module is introduced in the local branch. The input features and attention outputs are added element-wise using a skip fusion method, and then normalized to obtain the final global and local features.

[0034] (12)

[0035] (13)

[0036] in, Represents global features. Indicates local features, This indicates a global attention operation. This indicates a local attention operation. This represents the feature map after convolutional downsampling.

[0037] In this design, global features are primarily used to capture long-range dependencies and overall semantic structure, while local features are mainly used to capture fine-grained structural information within local regions. These two features complement the low-frequency and high-frequency features in the frequency domain, enhancing the target representation capability under haze degradation conditions.

[0038] 3) The low-frequency features obtained above High-frequency characteristics Global features and local features The networks are then fused to obtain the final frequency-domain-aware embedding. For the student network and the teacher network, the following results are obtained:

[0039] (14)

[0040] (15)

[0041] in, Represents the frequency-domain-aware embedding of the student network. This represents the frequency-domain-aware embedding of the teacher network.

[0042] By constructing the frequency domain-aware embedding, the target tracking model can simultaneously utilize low-frequency structure, high-frequency details, global semantics, and local texture information, overcoming the problem of unstable single spatial domain representation under hazy weather conditions, and providing a more stable and discriminative feature foundation for subsequent frequency domain representation alignment and consistency distillation.

[0043] Step 3: Construct an enhanced view from the search image and obtain the response distribution of the teacher network on the enhanced view; select samples or locations to participate in distillation based on the confidence level of the response distribution, and calculate the consistency distillation loss based on the difference in response distribution between the student network and the teacher network on the gated samples or locations.

[0044] 1) Since teacher networks typically provide relatively more accurate and stable feature distributions, this invention utilizes the frequency-domain-aware embedding of the teacher network as the alignment target, and applies a frequency-domain representation alignment loss to the frequency-domain-aware embedding of the student network. This allows the student network's feature distribution on smog-degraded samples to gradually approximate that of the teacher network, improving its adaptability to missing details, weakened edges, and degraded target appearance.

[0045] The basic idea behind this frequency domain representation alignment loss is to constrain the representational differences between the teacher and student branches within the fused frequency domain perceptual embedding space, thereby reducing the distributional deviation between the two branches in the joint frequency-space domain representation. This can be denoted as:

[0046] (16)

[0047] in, This represents the alignment constraint function, which constrains the student network embedding. With teacher network embedding The distribution differences.

[0048] Through this frequency domain representation alignment constraint, the student network can learn a more robust frequency domain perception representation during training, enabling the model to retain strong structural recognition and detail discrimination capabilities even in hazy scenarios.

[0049] 2) Under hazy weather conditions, the appearance distribution of the search area will change significantly, which can easily lead to unstable peak values ​​and spatial shifts in the target response map. To enhance the model's robustness to such perturbations, this invention further introduces an enhanced consistency distillation constraint.

[0050] An augmented view is constructed from the search image and input into the teacher network to obtain a teacher response graph on the augmented view. This is used as a more reliable distillation supervision signal. To avoid unreliable supervision from the teacher network on difficult augmented samples, this invention constructs a gating mechanism based on the sharpness of the response map peaks. Specifically, for the response map output by the teacher network on the augmented view, the maximum response value is taken. With the second largest response value The difference between the two is used as a confidence index. ,Right now:

[0051] (17)

[0052] Further, combine quantile thresholds and fixed thresholds to construct the gate threshold:

[0053] (18)

[0054] in, Indicator of confidence level The 0.85 quantile threshold, Indicates a fixed threshold. This represents the final gating threshold.

[0055] When the confidence index of a sample or location meets the gating condition, the sample or location is considered to have high supervisory reliability and can participate in the subsequent distillation loss calculation; otherwise, it is not included in the distillation loss calculation. In this way, supervisory signals with low confidence, high noise, or unstable response can be effectively filtered out, thereby avoiding interference from unreliable target distributions to student network training.

[0056] Furthermore, to improve the stability of the teacher distribution and suppress fluctuations during training, an exponential moving average is used to update the teacher network output, resulting in a smoother target distribution:

[0057] (19)

[0058] in Represents the coefficient of the exponential moving average. Indicates the first The updated teacher target distribution in the next iteration. Indicates the first The current output of the teacher network at the next iteration.

[0059] At gated samples or locations, the KL divergence is used to align the student network response distribution with the teacher network response distribution, resulting in an enhanced consistency distillation loss:

[0060] (20)

[0061] in, This represents the weight corresponding to the sample or position that passes the gate. Indicates the temperature coefficient. This represents the distribution of responses in the teacher network. This represents the response distribution of the student network. It is a small constant to prevent the denominator from being zero.

[0062] Therefore, the enhanced consistency distillation constraint does not blindly apply consistency supervision to all samples. Instead, it uses a combination mechanism of "enhanced view - confidence gating - EMA smoothing - KL alignment" to perform distribution alignment only in high-confidence regions, thereby effectively improving training stability and target localization reliability under hazy weather conditions.

[0063] Step 4: Construct the total loss and optimize the student network. The student network is optimized and updated using the weighted sum of the classification loss, localization loss, frequency domain representation alignment loss, and distillation loss as the total loss, resulting in a trained target tracking model. Frequency domain alignment for target tracking is then achieved based on this trained target tracking model.

[0064] Specifically, the total loss includes classification loss. In bounding box regression loss Loss, generalized intersection and comparison loss Frequency domain characterization of alignment loss and enhanced consistency distillation loss The overall training objective can be expressed as:

[0065] (twenty one)

[0066] in, , , and These represent the weighting coefficients corresponding to each loss term.

[0067] After training, the target tracking model receives template images and search images as input in practical applications. It first extracts features, then constructs a frequency domain-aware embedding, and outputs the position prediction result of the target to be tracked in the current frame through the target prediction head, thereby achieving stable tracking of the target under hazy weather conditions.

[0068] Beneficial effects:

[0069] 1. This invention discloses a frequency domain alignment method for robust visual target tracking under hazy weather conditions. By combining frequency domain features and spatial domain features and employing a frequency domain-aware representation construction mechanism, it significantly enhances the robustness of the target tracking model in low-visibility environments such as haze. Even with reduced image contrast and loss of detail caused by haze, the model effectively preserves the target's key structural information and texture details, thereby improving the accuracy and stability of target tracking.

[0070] 2. This invention discloses a frequency domain alignment method for robust visual target tracking under hazy weather conditions. By introducing an enhanced consistency distillation mechanism, combined with a confidence gating strategy and an exponential moving average mechanism, the distillation training process between the teacher network and the student network is optimized. By applying consistency distillation constraints only on high-confidence samples, the interference of noise supervision on model training is avoided, the stability of the training process is improved, and the drift phenomenon occurring during target tracking is reduced.

[0071] 3. This invention discloses a frequency domain alignment method for robust visual target tracking under hazy weather conditions. Through a frequency domain representation alignment mechanism, it effectively aligns the frequency domain perceptual embeddings of the teacher network and the student network, improving the student network's ability to represent targets. Especially in hazy environments, the stable feature distribution provided by the teacher network enables the student network to quickly adapt to haze degradation, reducing instability in the feature space and thus improving the accuracy of target tracking and localization. Attached Figure Description

[0072] Figure 1 This is a schematic diagram of the overall framework of the tracking method disclosed in this invention;

[0073] Figure 2 This is a schematic diagram of the frequency domain characterization alignment module in step two of the tracking method disclosed in this invention;

[0074] Figure 3 This is a schematic diagram of the Haar wavelet transform in step two of the tracking method disclosed in this invention;

[0075] Figure 4 This is a schematic diagram of the visualization results of tracking haze environmental data using the tracking method disclosed in this invention. Detailed Implementation

[0076] To better illustrate the purpose, content, and advantages of this invention, the following description, in conjunction with the accompanying drawings and examples, further explains the invention.

[0077] This embodiment discloses a frequency domain alignment method for robust visual target tracking under hazy weather conditions, such as... Figure 1As shown, the overall process includes stages such as training sample construction, teacher-student dual-network feature extraction, frequency domain-aware representation construction, joint consistency constraint training, and target prediction output. The teacher network provides relatively stable feature or response distribution supervision, while the student network, as the actual optimization object, learns a more robust target representation to haze degradation under frequency domain alignment constraints and enhanced consistency distillation constraints.

[0078] This embodiment discloses a frequency domain alignment method for robust visual target tracking under hazy weather conditions. The specific implementation steps are as follows:

[0079] Step 1: Obtain training samples and construct the teacher and student networks. First, obtain training samples, which consist of template images and search images. The template images provide prior appearance information of the target to be tracked, while the search images provide candidate region information for the target in the current frame. Both the template and search images contain haze degradation features, including decreased image contrast, blurred edges, weakened texture details, and loss of local region information.

[0080] Furthermore, a teacher network and a student network are constructed. Both the teacher network and the student network include a backbone feature extraction network, a frequency domain-aware representation module, and a target prediction head. Preferably, the backbone feature extraction network adopts the ViT backbone network. The teacher network and the student network can have the same backbone structure, but their parameter update methods differ. The student network, as the main optimization object, updates parameters through backpropagation; the teacher network is mainly used to provide a relatively stable feature distribution and response distribution, and can serve as a source of distillation supervision.

[0081] Let the template image be denoted as The search image is denoted as The template image and the search image are input into the teacher network and student network respectively for feature extraction, which can be represented as:

[0082] (1) (2)

[0083] in, This represents the feature extraction mapping of the teacher network. This represents the feature extraction mapping of the student network. This represents the feature map output by the teacher's network. This represents the feature map of the student's network output.

[0084] In this embodiment, both the feature maps output by the teacher network and the feature maps output by the student network are used as inputs for subsequent frequency domain sensing representation construction. Since the teacher network can provide a relatively more stable and discriminative target representation, introducing the teacher network can provide more reliable representation constraints and response distribution supervision for the student network under hazy weather conditions, thereby alleviating the target appearance drift problem caused by haze degradation.

[0085] Step two involves constructing frequency-domain-aware representations of the feature maps of the teacher and student networks obtained in Step one. This step mainly includes two parts: frequency domain decomposition and spatial domain modeling. Then, different features from the frequency and spatial domains are fused to obtain a frequency-domain-aware embedding. For example... Figure 2 and Figure 3 As shown, the frequency domain decomposition part uses Haar discrete wavelet transform to explicitly split the feature map into frequency bands.

[0086] 1) Let any input feature map be denoted as:

[0087] (3)

[0088] in, For the number of channels, For feature map height, This represents the width of the feature map.

[0089] For each channel, the corresponding two-dimensional feature First, perform a Haar wavelet transform along the row direction to obtain low-frequency approximation coefficients and high-frequency detail coefficients:

[0090] (4) (5)

[0091] in, Indicates the first The low-frequency approximation coefficients obtained after performing Haar transform on the rows. This represents the corresponding high-frequency detail coefficients; For row index; The index after grouping by a step size of 2 in the column direction.

[0092] After obtaining the low-frequency approximation coefficients and high frequency detail coefficient Then, Haar wavelet transform is performed along the column direction to obtain one low-frequency component and three high-frequency components in different directions:

[0093] (6)

[0094] (7)

[0095] (8)

[0096] (9)

[0097] in, This represents low-frequency components, primarily preserving the overall structural information and background trend information of the target. , and These represent high-frequency components in different directions, mainly reflecting the target's edge, texture, and detail changes.

[0098] Furthermore, the three high-frequency components are concatenated, and convolution mapping and normalization are performed on the low-frequency components and the concatenated high-frequency components respectively to obtain the final low-frequency features and high-frequency features:

[0099] (10)

[0100] (11)

[0101] in, This indicates low-frequency characteristics, primarily representing the overall structural information and background trend information of the target; It represents high-frequency features, mainly characterizing target edges, textures, and local details; This indicates a splicing operation. express Convolution operation, This indicates batch normalization operation.

[0102] Through the above processing, the frequency domain information in the input feature map can be explicitly decomposed into low-frequency structural information and high-frequency detail information. For visual images under hazy weather conditions, atmospheric scattering and attenuation effects often lead to a weakening of high-frequency texture information. Therefore, explicitly modeling the high-frequency components is beneficial for recovering edge and texture details related to target localization. At the same time, the low-frequency components can preserve the overall contour and global semantics of the target, which is beneficial for maintaining the stability of the target representation when details are insufficient.

[0103] 2) In addition to frequency domain modeling, this implementation also extracts global and local features simultaneously in the spatial domain to enhance the feature extraction efficiency.

[0104] Feature representation capability. Given input features Firstly, adopt The convolutional layer performs downsampling to obtain intermediate feature maps. This operation can reduce the computational complexity of subsequent operations and expand the receptive field, allowing the network to capture contextual information more fully.

[0105] Subsequently, a global attention module is introduced into the global branch, and a local attention module is introduced into the local branch. A skip fusion structure is then used to add the input features and attention outputs element-wise. After normalization, the final global and local features are obtained.

[0106] (12)

[0107] (13)

[0108] in, Represents global features. Indicates local features, This indicates a global attention operation. This indicates a local attention operation. This represents the feature map after convolutional downsampling.

[0109] In this design, global features are primarily used to capture long-range dependencies and overall semantic structure, while local features are mainly used to capture fine-grained structural information within local regions. These two features complement the low-frequency and high-frequency features in the frequency domain, enhancing the target representation capability under haze degradation conditions.

[0110] 3) The low-frequency features obtained above High-frequency characteristics Global features and local features By merging, we obtain

[0111] The final frequency-domain-aware embedding. For the student network and the teacher network, we obtain:

[0112] (14)

[0113] (15)

[0114] in, Represents the frequency-domain-aware embedding of the student network. This represents the frequency-domain-aware embedding of the teacher network.

[0115] By constructing the frequency domain-aware embedding, the target tracking model can simultaneously utilize low-frequency structure, high-frequency details, global semantics, and local texture information, overcoming the problem of unstable single spatial domain representation under hazy weather conditions, and providing a more stable and discriminative feature foundation for subsequent frequency domain representation alignment and consistency distillation.

[0116] Step 3: Utilize the frequency-domain-aware embedding of the teacher network obtained in Step 2. Frequency-domain-aware embedding of student networks Then, joint consistency constraints are executed. These joint consistency constraints include frequency domain characterization alignment constraints and enhanced consistency distillation constraints.

[0117] 1) Since teacher networks typically provide relatively more stable and accurate feature distributions, this invention utilizes teacher networks.

[0118] The frequency-domain-aware embedding of the network is used as the alignment target to impose frequency-domain representation alignment constraints on the frequency-domain-aware embedding of the student network. This allows the student network's feature distribution on smog-degraded samples to gradually approximate that of the teacher network, improving its adaptability to details loss, edge weakening, and target appearance degradation. The frequency domain alignment loss can be expressed as:

[0119] (16)

[0120] in, This represents the alignment constraint function, used to measure the distributional differences between student network embeddings and teacher network embeddings.

[0121] By introducing this frequency domain representation alignment constraint, the student network can learn a more robust frequency domain perception representation during training, enabling the model to retain strong structural recognition and detail discrimination capabilities even in hazy scenarios.

[0122] 2) Under hazy weather conditions, the appearance distribution of the search area will change significantly, which can easily lead to peak values ​​in the target response map.

[0123] The model exhibits instability and spatial offset. To enhance its robustness to such perturbations, this invention further introduces an enhanced consistency distillation constraint. First, an enhanced view is constructed from the search image and input into the teacher network to obtain the teacher response map on the enhanced view. This is used as a distillation monitoring signal.

[0124] To avoid unreliable supervision from the teacher network on difficult augmented samples, this implementation constructs a gating mechanism based on the sharpness of the response graph peaks. Specifically, for the response graph output by the teacher network on the augmented view, the maximum response value is taken. Second largest response value The difference between the two is used as the confidence index. Then, the gating threshold is constructed by combining the quantile threshold and the fixed threshold:

[0125] (17)

[0126] in, Indicator of confidence level The 0.85 quantile threshold, Indicates a fixed threshold. This represents the final gating threshold.

[0127] When the confidence index of a sample or location meets the gating condition, the sample or location is considered to have high supervisory reliability and can participate in the subsequent distillation loss calculation; otherwise, it is not included in the distillation loss calculation. In this way, supervisory signals with low confidence, high noise, or unstable response can be effectively filtered out, thereby avoiding interference from unreliable target distributions to student network training.

[0128] To improve the stability of the teacher distribution and suppress fluctuations during training, an exponential moving average is used to update the teacher network output, resulting in a smoother target distribution.

[0129] (18)

[0130] in Represents the coefficient of the exponential moving average. Indicates the first The updated teacher target distribution in the next iteration. Indicates the first The current output of the teacher network at the next iteration. Through the above smoothing process, the random fluctuations caused by single enhancement perturbations can be reduced, making the distillation target provided by the teacher network more stable.

[0131] For gated samples or locations, the response distributions of the teacher and student networks are aligned using KL divergence to obtain the enhanced consistency distillation loss:

[0132] (19)

[0133] in, This represents the weight corresponding to the sample or position that passes the gate. Indicates the temperature coefficient. This represents the distribution of responses in the teacher network. This represents the response distribution of the student network. It is a small constant to prevent the denominator from being zero.

[0134] Therefore, the enhanced consistency distillation constraint does not blindly apply consistency supervision to all samples. Instead, it uses a combination mechanism of "enhanced view - confidence gating - EMA smoothing - KL alignment" to perform distribution alignment only in high-confidence regions, thereby effectively improving training stability and target localization reliability under hazy weather conditions.

[0135] Step four: Construct the total loss and optimize the student network. The student network is optimized and updated using the weighted sum of classification loss, localization loss, frequency domain representation alignment loss, and distillation loss as the total loss, resulting in the trained target tracking model. Specifically, the total loss includes classification loss. In bounding box regression loss Loss, generalized intersection and comparison loss Frequency domain characterization of alignment loss and enhanced consistency distillation loss The overall training objective can be expressed as:

[0136] (20)

[0137] in, , , and The weight coefficients for each loss term are represented by 1.0, 5.0, 2.0, 0.5, and 20.0, respectively. During training, the student network parameters are optimized and updated by minimizing the total loss function to obtain the trained target tracking model. The teacher network parameters can be obtained by updating the student network parameters through an exponential moving average, or by using a frozen parameter method as a stable supervision source.

[0138] In a preferred embodiment, model training can employ the AdamW optimizer for parameter updates, with a batch size of 64 and a training epoch count of 50. The learning rate and weight decay coefficients can be set according to the actual dataset and tracking task requirements. Furthermore, the learning rate can decay at predetermined epochs, for example, adjusting it to a certain percentage of its initial value at the 40th epoch. The EMA update coefficient of the teacher network is preferably set to 0.99, and the temperature coefficient in the consistency distillation is preferably set to 2.0, but these values ​​are not limited to the specific values ​​mentioned above.

[0139] After training, the target tracking model receives template images and search images as input in practical applications. It first extracts features, then constructs a frequency domain-aware embedding, and outputs the position prediction result of the target to be tracked in the current frame through the target prediction head, thereby achieving stable tracking of the target under hazy weather conditions.

[0140] Specifically, in the initial frame of the video, a template image is generated based on the bounding box or a manually specified target region. In subsequent frames, a search region for the current frame is constructed based on the target position of the previous frame, and the template image and the search region are input together into the trained target tracking model. The model outputs the classification score and position regression result of the target in the current frame, and determines the predicted position of the target to be tracked in the current frame accordingly.

[0141] Because the model incorporates frequency domain representation alignment constraints and enhanced consistency distillation constraints during the training phase, it can more effectively preserve the overall structural information and local detail information of the target in hazy environments, and improve its adaptability to target appearance degradation, low contrast, and background interference, thereby enhancing the accuracy and stability of visual target tracking. Figure 4 As shown, the method of the present invention can achieve a relatively stable target tracking effect on hazy environmental data.

[0142] To verify the effectiveness of the proposed method in visual target tracking under hazy weather conditions, the DTB70-Haze dataset and the AVisT dataset were used to test the method, and comparative experiments were conducted with several existing mainstream target tracking methods. The experimental platform used the Ubuntu operating system and the PyTorch deep learning framework, and the model training and testing were completed under the NVIDIA RTX 3090 series GPU environment. The experimental results are shown in Tables 1 and 2.

[0143] On the DTB70-Haze test set, the method of this invention achieved an AUC of 66.42% and a Precision of 87.36%, both outperforming existing excellent target tracking methods such as UMDATrack, ARTrackV2, and ODTrack. On the AVisT test set, the method of this invention also achieved an AUC of 52.49% and a Precision of 49.35%, maintaining relatively stable target localization capabilities even in complex and degraded environments. Experimental results show that the frequency domain alignment mechanism and enhanced consistency distillation strategy proposed in this invention can effectively enhance the target tracking accuracy and robustness of the model under hazy weather conditions.

[0144] Table 1 Comparative Experiment Results of DTB70-Haze Test Set

[0145]

[0146] Table 2 Comparative Experiment Results of AVisT Test Set

[0147]

[0148] The above detailed description further illustrates the purpose, technical solution, and beneficial effects of the invention. It should be understood that the above description is only a specific embodiment of the present invention and is not intended to limit the scope of protection of the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the scope of protection of the present invention.

Claims

1. A frequency domain alignment method for robust visual target tracking under hazy weather conditions, characterized in that: Includes the following steps: Step 1: Extract feature maps from the training samples using the teacher network and student network; obtain the teacher branch feature map output by the teacher network and the student branch feature map output by the student network. Step 2: Perform discrete wavelet transform on the teacher and student branch feature maps obtained in Step 1 to obtain low-frequency and high-frequency features; use a spatial domain context modeling mechanism to extract global and local features; finally, fuse the low-frequency features, high-frequency features, global features, and local features to obtain frequency domain-aware embedding and frequency domain alignment loss. Step 3: Construct an enhanced view from the search image and obtain the response distribution of the teacher network on the enhanced view; select samples or locations to participate in distillation based on the confidence level of the response distribution, and calculate the consistency distillation loss based on the difference in response distribution between the student network and the teacher network on the gated samples or locations; Step 4: Construct the total loss function, which is obtained by weighted combination of classification loss, localization loss, generalized intersection-over-union loss, frequency domain alignment loss, and consistency distillation loss; optimize and update the student network based on the total loss function to obtain a trained target tracking model, and achieve frequency domain alignment of target tracking based on the trained target tracking model.

2. The frequency domain alignment method for robust visual target tracking under hazy weather conditions as described in claim 1, characterized in that: Step two describes the discrete wavelet transform as a Haar discrete wavelet transform. This transform performs pairwise sampling transformations on the input feature map along both row and column directions with a step size of 2 to obtain four frequency sub-bands. Specifically, this includes: 1) For the branch feature map each line index pairs of adjacent columns The low-frequency approximation coefficients were calculated. High-frequency detail coefficients : (1) (2) in, For row index, For column-direction pairwise sampling indexes; 2) For the above and For each column, for adjacent row index pairs Calculate four sub-bands: low-frequency sub-band Horizontal high-frequency subband Vertical high-frequency subband Diagonal high-frequency subband : (3) (4) (5) (6) in, For row-direction pairwise sampling index; 3) The low-frequency subband As a low-frequency feature and the horizontal high-frequency subband Vertical high-frequency subband and diagonal high-frequency subband Stitching along predetermined dimensions to form high-frequency features : (7) (8) in, This indicates a normalization operation; Indicates the convolution operation; This indicates a splicing operation.

3. The frequency domain alignment method for robust visual target tracking under hazy weather conditions as described in claim 1, characterized in that: The method for extracting global and local features using the spatial domain context modeling mechanism described in step two is as follows: The feature map is downsampled to expand the receptive field and reduce computational overhead. Global features are extracted from the downsampled feature map using a global attention operator, and local features are extracted using a local attention operator. The global attention operator is used to establish correlations in the global range of the feature map to obtain global features, and the local attention operator is used to establish correlations in the local neighborhood range to obtain local features.

4. The frequency domain alignment method for robust visual target tracking under hazy weather conditions as described in claim 1, characterized in that: The method for selecting samples or locations to participate in distillation based on the confidence level of the response distribution in step three is as follows: Distribution of teacher network responses on augmented view Calculate the confidence index : - (9) in, The maximum response value of the response distribution. This is the second largest response value in the response distribution; Set quantile thresholds based on the distribution of confidence indicators in the current training batch or a preset statistical window. With fixed threshold Based on the confidence level indicator, the final gating threshold is obtained by combining the indicators. : (10) When the response value exceeds the gating threshold, the corresponding sample is determined to pass the gating and a uniform distillation constraint is applied; when the response value does not exceed the gating threshold, no uniform distillation constraint is applied to the corresponding sample.

5. The frequency domain alignment method for robust visual target tracking under hazy weather conditions as described in claim 1, characterized in that: In step three, in the samples subject to the uniform distillation constraint as described in claim 4, the KL divergence is used to calculate the difference in response distribution between the student network and the teacher network to obtain the uniform distillation loss. The calculation formula is as follows: (11) in, The weight coefficients corresponding to the samples that pass the gating. Temperature factor and These represent the response distributions of the student network and the teacher network, respectively.

6. The frequency domain alignment method for robust visual target tracking under hazy weather conditions as described in claim 1, characterized in that: The total loss function described in step four for: (12) in, For classifying losses, For positioning loss, For generalized intersection and comparison of losses, To characterize the alignment loss in the frequency domain, For consistent distillation loss, , , , The weighting coefficients are assigned to the corresponding loss terms, and the total loss function is minimized. Update the student's network parameters.

7. The frequency domain alignment method for robust visual target tracking under hazy weather conditions as described in claim 1, characterized in that: The training samples consist of template images and search images containing haze degradation features.

8. The frequency domain alignment method for robust visual target tracking under hazy weather conditions as described in claim 1, characterized in that: Both the teacher network and the student network include a backbone feature extraction network and a target prediction head; feature maps are extracted through the backbone feature extraction network.

9. A frequency domain alignment method for robust visual target tracking under hazy weather conditions as described in claims 1, 2, 3, 4, 5, 6, 7, and 8, characterized in that: The training samples include template images and search images, which contain haze degradation features. The haze degradation features are mainly manifested as decreased image contrast, weakened texture details, blurred edge contours, and degraded target appearance information. The template images are used to provide prior appearance information of the target to be tracked, and the search images are used to provide candidate region information of the target to be tracked in the current frame.