A frequency selection based spatio-temporal autoregressive single target tracking method
By introducing a frequency selection mechanism and a spatiotemporal autoregressive mechanism, the computational and memory bottlenecks of the Transformer model in single-target tracking are solved, enhancing the model's target tracking capability in complex backgrounds and achieving high-precision and robust single-target tracking.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- KUNMING UNIV OF SCI & TECH
- Filing Date
- 2026-04-27
- Publication Date
- 2026-06-26
AI Technical Summary
Existing single-target tracking methods based on the Transformer model face computational and memory bottlenecks when handling long sequences and real-time tracking tasks. Furthermore, traditional methods have limitations in capturing dynamic changes in targets and struggle to effectively distinguish targets from the background in complex contexts.
A frequency selection mechanism is introduced to optimize spatial feature extraction, and a spatiotemporal autoregressive mechanism is combined to capture the spatiotemporal dependence of the target. The model's ability to model the target's motion laws is enhanced by a spatial-frequency collaborative backbone network and a spatiotemporal autoregressive network.
It improves the accuracy and robustness of single-target tracking, enables stable target tracking in dynamic environments, and enhances the model's adaptability in complex contexts.
Smart Images

Figure CN122089787B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of computer vision technology, specifically to a spatiotemporal autoregressive single-target tracking method based on frequency selection. Background Technology
[0002] Single-object tracking is a key task in computer vision, aiming to track and locate changes in the position of a target object in real time from a series of video frames or image sequences, based on the target's initial appearance. Single-object tracking is widely used in many practical scenarios, especially in video surveillance, robot navigation, drone tracking, and augmented reality. However, with the dynamic changes in the target's appearance and the increasing complexity of the background, building a model that can effectively distinguish the target from the background has become a major challenge in single-object tracking.
[0003] Currently, single-object tracking methods based on the Transformer model have shown significant advantages in capturing long-range dependencies and modeling global information. However, due to the quadratic complexity of its self-attention mechanism, the Transformer model may face significant computational and memory bottlenecks when handling long sequences and real-time tracking tasks. Furthermore, traditional single-object tracking methods typically rely on static spatial features, limiting their ability to capture dynamic changes in targets. Frequency selection mechanisms can dynamically optimize the selection of frequency components based on the target's motion characteristics, enabling the model to more accurately capture key target features and thus enhancing its target recognition capabilities. Spatiotemporal autoregressive mechanisms, by fusing spatial and temporal information, effectively capture the dynamic changes of targets in the spatiotemporal dimensions, further enhancing the modeling ability of the target region. Summary of the Invention
[0004] To address the aforementioned issues, this invention provides a frequency-selective spatiotemporal autoregressive single-target tracking method. This invention optimizes the spatial feature extraction process by introducing a frequency selection mechanism and combines it with a spatiotemporal autoregressive mechanism to capture the spatiotemporal dependencies of the target. This method enhances the model's ability to model the target's motion patterns, and significantly improves target tracking accuracy and robustness, especially in dynamic environments such as irregular target motion, complex backgrounds, or occlusion.
[0005] To achieve the above objectives, the present invention provides the following technical solution: a spatiotemporal autoregressive single-target tracking method based on frequency selection, specifically including the following steps:
[0006] S1. Based on the public dataset, after extracting the image pair data corresponding to the template frame and the search frame, preprocess the image pair data to generate a one-dimensional sequence of template frames and a one-dimensional sequence of search frames. After linear mapping to a preset dimension, one-dimensional positional codes are embedded to form a template word sequence and a search region word sequence. The template word sequence and the search region word sequence are then concatenated in order to generate a concatenated word sequence.
[0007] S2. Based on spliced word sequences and the Vision Mamba model, spatial features are output by constructing a space-frequency collaborative backbone network.
[0008] The space-frequency cooperative backbone network includes a predetermined number of cascaded space-frequency cooperative modules;
[0009] The space-frequency coordination module includes: a global representation modeling module and a local feature enhancement module;
[0010] S3. Based on spatial features, a spatiotemporal autoregressive network is constructed to output target perception enhanced by spatiotemporal attention;
[0011] The spatiotemporal autoregressive network includes a preset number of stacked spatiotemporal autoregressive modules;
[0012] The spatiotemporal autoregressive module includes a temporal attention module, a multi-head attention module, a feedforward neural network module, a gated historical feature update module, and a spatiotemporal feature fusion module;
[0013] S4. Based on the target perception enhanced by spatiotemporal attention, the target is reshaped into a two-dimensional spatial feature map and then input into a fully convolutional neural network. The output includes the category probability distribution, bounding box, and local offset. The target is then trained by constructing an overall loss function to obtain the total loss value and perform target localization, thus completing the spatiotemporal autoregressive single target tracking method based on frequency selection.
[0014] Preferably, step S1 specifically includes the following steps:
[0015] S1.1 Extract template frames based on public datasets. and search frame The corresponding image pairs data, where, and These represent the height and width of the template frame image, respectively. and These represent the height and width of the search frame, respectively; for the template frame and search frame After performing preprocessing operations, a one-dimensional sequence of template frames is obtained. and search frame one-dimensional sequence ;
[0016] The preprocessing operation specifically involves: dividing the image into blocks according to preset pixels, and flattening the divided image blocks into a one-dimensional sequence of template frames. and search frame one-dimensional sequence ,in, To preset the resolution of each image patch, and These represent the number of blocks for the preset template frame and the search frame, respectively.
[0017] S1.2 After mapping the template frame one-dimensional sequence and the search frame one-dimensional sequence to a preset dimension through a trainable linear mapping layer, one-dimensional positional encoding is embedded respectively to generate template word sequence and search region word sequence;
[0018] The expressions for generating the template word sequence and the search region word sequence are as follows:
[0019]
[0020]
[0021] In the formula, For template word sequence, For the search region word sequence, To preset linear mapping parameters, One-dimensional positional encoding can be learned from the template. For searching, learnable one-dimensional positional encoding;
[0022] S1.3. Concatenate the template word sequence and the search area word sequence in order to generate a concatenated word sequence;
[0023] The expression for generating the concatenated word sequence is as follows:
[0024]
[0025] In the formula, This is for splicing word sequences.
[0026] Preferably, step S2 specifically includes the following steps:
[0027] S2.1. Based on the spliced word sequence and the Vision Mamba model, a space-frequency collaborative backbone network is defined.
[0028] The expression for the space-frequency cooperative backbone network is:
[0029]
[0030] In the formula, This represents a one-dimensional convolution operation. Indicating the first in the space-frequency cooperative backbone network The feature modeling process of the layer-space-frequency coordination module, To concatenate word sequences;
[0031] S2.2, A defined space-frequency collaborative backbone network, through definition The output of the previous layer is the output feature of the previous layer. The output feature of the previous layer is input into the parallel bidirectional state space model SSM module branch, the global selective frequency enhancement module branch and the global adaptive weight fusion module branch for processing at the same time. After outputting the global feature, global enhancement feature and global weight feature, they are added element by element to obtain the global representation feature, thus completing the construction of the global representation modeling module.
[0032] S2.3 Based on global representation features, through parallel hybrid gating module branches, local selective frequency enhancement module branches, and local adaptive weight fusion module branches, local enhancement features, local weight features are output and then added element by element to obtain spatial features, thus completing the construction of the space-frequency coordination module.
[0033] Preferably, step S2.2 specifically includes the following steps:
[0034] S2.2.1 Constructing the Bidirectional State-Space Model (SSM) Module Branch: Input the output features of the previous layer, and after performing normalization and linear projection operations, obtain the value vector and query vector; input the value vector into the Bidirectional State-Space Model (SSM) module, and then perform linear projection to generate the input matrix, output matrix, and time scale parameters; initialize the state transition matrix based on the high-order polynomial projection operator matrix, and after discretizing the state transition matrix and output matrix, perform SSM forward and backward calculations to obtain the global features;
[0035] The expression for obtaining the global features is as follows:
[0036]
[0037] In the formula, As a global feature, Forward features It is a backward feature; Linear layer;
[0038] S2.2.2 Constructing the global selective frequency enhancement module branch: After splitting the output features of the previous layer into template word sequences and search region word sequences, input them into the global selective frequency enhancement module respectively to obtain template enhanced features and search enhanced features. Then, they are concatenated to output the global enhanced features.
[0039] The specific calculation process for outputting the global enhanced features is expressed as follows:
[0040]
[0041]
[0042]
[0043] in, For global template word sequence, For the global search region, word sequence Represents the Gaussian error linear unit activation function. This represents a two-dimensional Fast Fourier Transform. This represents the inverse two-dimensional fast Fourier transform. For global feature enhancement, To enhance features for templates, Enhanced features for search;
[0044] S2.2.3, Constructing the global adaptive weight fusion module branch: The output features of the previous layer are weighted by a preset global learnable scaling factor to obtain global weight features;
[0045] The expression for obtaining the global weight features is as follows:
[0046]
[0047] In the formula, For global weight features, Preset a globally learnable scaling factor;
[0048] S2.2.4. Based on global features, global enhancement features, and global weight features, element-wise summation is performed to obtain global representation features;
[0049] The calculation expression for the global representation features is as follows:
[0050]
[0051] In the formula, It represents the global characteristics.
[0052] Preferably, step S2.3 specifically includes the following steps:
[0053] S2.3.1 Constructing the hybrid gating module branches: After normalizing the global representation features and expanding the dimensions by one-dimensional convolution channels, the features are split into first features and second features by halving the channel dimensions. These features are then input into the two branches of the hybrid gating module, which output local features and local gating weights respectively. The local features and local gating weights are then gated and fused to obtain local enhanced features.
[0054] The two branches of the hybrid gating module specifically include: a local feature extraction branch constructed from one-dimensional convolution, depthwise convolution, and channel attention, and a gating weight generation branch constructed from linear projection, Gaussian error, and linear unit activation function;
[0055] The expression for the local enhancement feature is as follows:
[0056]
[0057] in, It is a local enhancement feature. For channel attention operations, For depthwise convolution operations, As the first feature, As the second feature, Represents the activation function of the Gaussian error linear unit; For element-wise multiplication;
[0058] S2.3.2 Constructing the Local Selective Frequency Enhancement Module Branch: After splitting the global representation features into local template features and local search features, input them into the local selective frequency enhancement module to obtain local template frequency enhancement features and local search frequency enhancement features; concatenate the local template frequency enhancement features and local search frequency enhancement features to output the local enhancement features;
[0059] The specific calculation expression for outputting local enhanced features is as follows:
[0060]
[0061]
[0062]
[0063] In the formula, This is a local template frequency enhancement feature. This is a feature to enhance the local search frequency. This represents a two-dimensional Fast Fourier Transform. This represents a one-dimensional convolution operation. This is a local enhancement feature; Local template features; For local search features;
[0064] S2.3.3, Constructing the Local Adaptive Weight Fusion Module Branch: By pre-setting a local learnable scaling factor, the global representation features are weighted to obtain local weight features;
[0065] The expression for obtaining the local weighted features is as follows:
[0066]
[0067] In the formula, Local weight features To preset a locally learnable scaling factor, As a global representation feature;
[0068] S2.3.4. Based on local enhancement features, local augmentation features, and local weight features, element-wise summation is performed to obtain spatial features, thus completing the construction of the space-frequency coordination module;
[0069] The expression for the spatial features is as follows:
[0070]
[0071] In the formula, It is a spatial feature.
[0072] Preferably, step S3 specifically includes the following steps:
[0073] S3.1 After setting the current frame as a spatial feature, the feature enhanced with historical information is obtained through the temporal attention module;
[0074] Specifically: Let the current time be... Then the historical aggregation feature is denoted as , will the current frame After setting as spatial features First, and The keys are obtained through learnable linear mapping layers. Query Sum The expression is as follows:
[0075]
[0076]
[0077]
[0078] Then, using keys, values, and queries as inputs, temporal attention modeling is performed to obtain intermediate spatial features. The expression is as follows:
[0079]
[0080] in, This represents the attention calculation function; Represents the normalized exponential function, This indicates the transpose operation. Represent the dimension of the key vector; then... and After performing residual connections and normalization, we obtain features enhanced with historical information. ;
[0081] S3.2 Input the historical information-enhanced features and the current frame, and obtain the structured information-enhanced features through the multi-head attention module;
[0082] Specifically: and The input to the multi-head attention layer is obtained through a learnable linear mapping layer. ):
[0083]
[0084]
[0085]
[0086] The multi-head attention layer is composed of The system consists of several time-attention sub-layers connected in parallel. The outputs of each sub-layer are concatenated and then mapped to a single output matrix. Linear projection is used to re-integrate spatial features, as expressed below:
[0087]
[0088] in, This represents the output of the multi-head attention layer. This indicates multi-head attention computation. This indicates a splicing operation. Indicating the first in the multi-head attention layer The output of each time attention sublayer; and After performing residual connections and normalization, we obtain features with enhanced structured information. ;
[0089] S3.3 Input spatial information enhancement features, and through the feedforward neural network module, output structured information enhancement features;
[0090] Specifically, a fully connected layer is used to perform linear projection and affine transformation, and a nonlinear activation function is introduced to enhance expressive power. The output of the feedforward neural network... Defined as:
[0091]
[0092] In the formula, It is a fully connected layer. Represents the activation function of the Gaussian error linear unit;
[0093] Then and After performing residual connections and normalization, we obtain features with enhanced structured information. ;
[0094] S3.4 Input structured information to enhance features, and output historical aggregated features through the gated historical feature update module;
[0095] Specifically: Aggregation features with the previous frame We perform weighted fusion to obtain the historical aggregated feature representation:
[0096]
[0097]
[0098] in, Represents the gate vector, express Activation function Represents a learnable linear mapping. It will be updated to the dynamic feature pool as Historical aggregation representation of moments;
[0099] S3.5 Input spatial features and historical aggregation features, and through the spatiotemporal feature fusion module, obtain the target perception after spatiotemporal attention enhancement, and complete the construction of the spatiotemporal autoregressive module;
[0100] Specifically, the attention map is obtained by calculating the dot product of spatial features and historical aggregated features. The calculation expression is as follows:
[0101]
[0102] in, Represents the parameterless dot product. The preset number of blocks for the search frame. For historical characteristics quantity; then We weight the feature vectors of all search regions in the current frame to obtain the target perception representation after spatiotemporal attention enhancement. .
[0103] Preferably, step S4 specifically includes the following steps:
[0104] S4.1 Target perception based on spatiotemporal attention enhancement is reshaped into a two-dimensional spatial feature map and then input into a fully convolutional neural network to output the class probability distribution, bounding box and local offset.
[0105] S4.2 Based on the category probability distribution, bounding box, and local offset, a total loss function is constructed for training to obtain the total loss value and perform target localization;
[0106] The overall loss function The expression is:
[0107]
[0108] In the formula, For weighted focus loss, For generalized intersection and comparison of losses, For regularization loss, The preset generalized cross-union ratio loss parameters are used. These are the preset regularization parameters.
[0109] S4.3. Using the position with the highest probability in the category probability distribution as the target location, local offset is applied to locate and obtain the target size, generating the final bounding box, thus completing the spatiotemporal autoregressive single-target tracking method based on frequency selection.
[0110] Compared with existing technologies, this invention provides a frequency-selective spatiotemporal autoregressive single-target tracking method, which has the following advantages:
[0111] This invention provides a frequency-selective spatiotemporal autoregressive single-target tracking method that captures the spatial features of targets in dynamic environments, effectively enhancing the ability to distinguish targets from the background and thus achieving more stable and reliable target tracking. Combining the frequency selection mechanism with the Vision Mamba model enables the model to effectively select target features with the highest discriminative power in different frequency domains, thereby enhancing the target's recognition ability in complex environments. The introduction of a spatiotemporal autoregressive mechanism effectively utilizes historical information to model the dynamic changes of the target. By combining a temporal attention mechanism, this method can capture the correlation between historical aggregated features and current spatial features, enhancing the model's understanding of the target's motion patterns in the temporal dimension. The integrated application of these technologies not only effectively improves the accuracy of single-target tracking but also enhances the model's adaptability in complex dynamic environments, making this invention a highly efficient and robust technical solution for single-target tracking tasks in various scenarios. Attached Figure Description
[0112] Figure 1 This is a flowchart of the steps of the present invention;
[0113] Figure 2 This is a visualization of the prediction results of this invention on the OTMJ dataset. Detailed Implementation
[0114] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0115] like Figure 1 As shown, a spatiotemporal autoregressive single-target tracking method based on frequency selection specifically includes the following steps:
[0116] S1. Based on the public dataset, extract the corresponding image pairs of the template frame and the search frame, perform preprocessing operations on the image pairs to generate a one-dimensional sequence of the template frame and a one-dimensional sequence of the search frame; after linear mapping to a preset dimension, embed one-dimensional positional codes to form a template word sequence and a search region word sequence, and concatenate the template word sequence and the search region word sequence in order to generate a concatenated word sequence.
[0117] Specifically, the following steps are included:
[0118] S1.1 Extract template frames based on public datasets. and search frame The corresponding image pairs data, where, and These represent the height and width of the template frame image, respectively. and These represent the height and width of the search frame, respectively; and the template frame. and search frame Perform preprocessing operations to obtain a one-dimensional sequence of template frames. and search frame one-dimensional sequence ;
[0119] The preprocessing operation specifically involves: dividing the image into preset pixel blocks, and then flattening the divided image blocks into a one-dimensional sequence of template frames. and search frame one-dimensional sequence ,in, This indicates the resolution of each image patch. and These represent the number of blocks for the template frame and the search frame, respectively.
[0120] In this embodiment, template frames are read according to the index in the publicly available dataset OTMJ. and search frame The corresponding image pairs will be the input template frames. and search frame Preprocessing is performed; the template frame is divided into 128×128 pixels, and the search frame is divided into 256×256 pixels. The divided image blocks are then processed according to... Segment and flatten into a one-dimensional sequence and .
[0121] S1.2 After mapping the template frame one-dimensional sequence and the search frame one-dimensional sequence to a preset dimension through a trainable linear mapping layer, one-dimensional positional encoding is embedded respectively to generate template word sequence and search region word sequence;
[0122] In this embodiment, the linear mapping parameter is set to The trainable linear mapping layer will and Mapped to In the latent space, the learnable one-dimensional position is then encoded. and These are embedded into the mapping results described above to form the final template word sequence. and search area word sequence The expression is as follows:
[0123]
[0124]
[0125] In the formula, For template word sequence, For the search region word sequence, A one-dimensional sequence of template frames. For a one-dimensional sequence of search frames, To preset linear mapping parameters, One-dimensional positional encoding can be learned from the template. Learnable one-dimensional positional encoding for searching.
[0126] S1.3. Concatenate the template word sequence and the search area word sequence in order to generate a concatenated word sequence;
[0127] This embodiment uses template word sequences and search area word sequence Concatenate them in order, the expression is as follows:
[0128]
[0129] The generated concatenated word sequence As input to the space-frequency coordinated backbone network.
[0130] S2. Based on concatenated word sequences and the Vision Mamba model, a space-frequency collaborative backbone network is constructed to output spatial features. In terms of hierarchical design, each layer of the network includes two parts: global representation and local enhancement. First, the global receptive field is extracted with linear complexity using Vision Mamba as the core. Second, selective frequency enhancement is performed on local features using a hybrid gating mechanism. Both are supplemented by an adaptive weight fusion mechanism and a selective frequency enhancement module to obtain a frequency-enhanced discriminative spatial feature representation.
[0131] S2.1. Based on the spliced word sequence and the Vision Mamba model, a space-frequency collaborative backbone network is defined.
[0132] The space-frequency cooperative backbone network includes a predetermined number of cascaded space-frequency cooperative modules;
[0133] The space-frequency collaboration module includes: a global representation modeling module, a local feature enhancement module, one-dimensional convolution, and residual connections;
[0134] In this embodiment, the space-frequency coordination backbone network is composed of 24 cascaded space-frequency coordination modules. The function of the entire network can be summarized as follows:
[0135]
[0136] In the formula, This represents a one-dimensional convolution operation with a kernel size of 4. Indicating the first in the space-frequency cooperative backbone network The feature modeling process of the layer-space-frequency coordination module, This is for splicing word sequences.
[0137] S2.2 Based on the defined space-frequency collaborative backbone network, the input of the output features of the previous layer is processed simultaneously through the parallel bidirectional state-space model (SSM) module branch, the global selective frequency enhancement module branch, and the global adaptive weight fusion module branch. After outputting global features, global enhancement features, and global weight features, they are added element by element to obtain global representation features, thus completing the construction of the global representation modeling module.
[0138] Specifically, this embodiment provides The output is Then the first The layer-space-frequency coordination module first selects three parallel branches to process simultaneously. To model its global representation features, the following steps are included:
[0139] S2.2.1 Constructing the Bidirectional State-Space Model (SSM) Module Branch: Input the output features of the previous layer, and after performing normalization and linear projection operations, obtain the value vector and query vector; input the value vector into the Bidirectional State-Space Model (SSM) module, and then perform linear projection to generate the input matrix, output matrix, and time scale parameters; initialize the state transition matrix based on the high-order polynomial projection operator (HiPPO) matrix, and after discretizing the state transition matrix and output matrix, perform SSM forward and backward calculations to obtain the global features;
[0140] First of all Normalization operation is performed to obtain This calibrates and constrains the input distribution; then... The value vector is obtained by performing a linear projection. and query vector Next, the bidirectional state-space model (SSM) module will be used to... Processing to obtain The expression is as follows:
[0141]
[0142] in, This represents an adaptive activation function.
[0143] Then on Perform linear projection to generate the input matrix. Output matrix and time scale parameters The expression is as follows:
[0144]
[0145]
[0146]
[0147] in, This indicates exponentiation. This represents a predefined parameter; This is a linear layer. The state transition matrix is obtained based on the HiPPO matrix initialization. And using zero-order hold technique and timescale parameters ,right and Further processing is required to obtain the discretized result. and , expressed as and The expression is as follows:
[0148]
[0149]
[0150] in, This represents the identity matrix. Next, the forward feature is calculated using the State-Space Model (SSM). and backward features The expression is as follows:
[0151]
[0152]
[0153] in, and These represent the forward SSM and backward SSM modeling processes, respectively. Furthermore, The specific calculation process is expressed as follows:
[0154]
[0155]
[0156] in, This represents the current input to the SSM. Indicates the current hidden state. This indicates the previous hidden state. This represents the current output of the SSM. Finally, the output of the current branch is obtained, representing the global features. The expression is as follows:
[0157] .
[0158] S2.2.2 Constructing the global selective frequency enhancement module branch: After splitting the output features of the previous layer into template word sequences and search region word sequences, input them into the global selective frequency enhancement module respectively to obtain template enhanced features and search enhanced features. Then, they are concatenated to output the global enhanced features.
[0159] This invention introduces a selective frequency enhancement module as a frequency domain aid, which, without changing the main structure, utilizes a two-dimensional fast Fourier transform gain to globally characterize and directionally enhance high-frequency components. Specifically, the input... Split into a global template word sequence and global search region word sequence Then, both are input into a global selective frequency enhancement module based on two-dimensional fast Fourier transform to obtain refined template enhancement features in the frequency domain. and search enhancement features Finally, the output global augmentation features of the current branch are concatenated to obtain the final output features. The specific calculation process is expressed as follows:
[0160]
[0161]
[0162]
[0163] in, , This represents a two-dimensional Fast Fourier Transform. This represents the inverse two-dimensional fast Fourier transform;
[0164] S2.2.3, Constructing the global adaptive weight fusion module branch: The output features of the previous layer are weighted by a preset global learnable scaling factor to obtain global weight features;
[0165] This invention utilizes a preset globally learnable scaling factor. A global adaptive weight fusion module performs dynamic weight allocation to alleviate representation mismatch and improve space-frequency collaborative modeling capabilities, resulting in output global weight features. The expression is as follows:
[0166] .
[0167] S2.2.4. Based on global features, global enhancement features, and global weight features, element-wise summation is performed to obtain global representation features;
[0168] In summary, the expression for the global representation feature is:
[0169] .
[0170] S2.3 Based on global representation features, through parallel hybrid gating module branches, local selective frequency enhancement module branches, and local adaptive weight fusion module branches, local enhancement features, local weight features are output and then added element by element to obtain spatial features, thus completing the construction of the space-frequency coordination module.
[0171] global representation features As input to the local enhancement section, this section also consists of three parallel branches. Using a hybrid gating module as its core, it selectively amplifies or suppresses local features to adapt to the spatial variation characteristics of remotely sensed images. Specifically, it includes the following steps:
[0172] S2.3.1 Constructing the hybrid gating module branches: After normalizing the global representation features and expanding the dimensions by one-dimensional convolution channels, the features are split into first features and second features by halving the channel dimensions. These features are then input into the two branches of the hybrid gating module, which output local features and local gating weights respectively. The local features and local gating weights are then gated and fused to obtain local enhanced features.
[0173] The two branches of the hybrid gating module specifically include: a local feature extraction branch constructed from one-dimensional convolution, depthwise convolution, and channel attention, and a gating weight generation branch constructed from linear projection, Gaussian error, and linear unit activation function;
[0174] First of all Normalization operation is performed to obtain And then Perform a one-dimensional convolution operation to double its channel dimension, then further split it by halving the channel dimension into... and The values are fed into two branches of the hybrid gating module. In the first branch, one-dimensional convolution, depthwise convolution, and channel attention operations are used to obtain the coordinates (features captured by local convolution). In the second branch, linear projection and Gaussian error linear unit activation functions are used to generate the gating weights. The expression for the locally enhanced features output by the hybrid gating module branches is shown below:
[0175]
[0176] in, Indicates local enhancement features. This indicates channel attention operations. This represents a 3×3 depthwise convolution operation. This is for element-wise multiplication.
[0177] S2.3.2 Constructing the Local Selective Frequency Enhancement Module Branch: After splitting the global representation features into local template features and local search features, input them into the local selective frequency enhancement module to obtain local template frequency enhancement features and local search frequency enhancement features; concatenate the local template frequency enhancement features and local search frequency enhancement features to output the local enhancement features;
[0178] Similarly, input global representation features Decomposed into local template features and local search features Then and The local template frequency enhancement features are obtained by inputting the global selective frequency enhancement module separately. and local search frequency enhancement features Finally, the local augmentation features of the current branch are concatenated to obtain the output. The specific calculation process is as follows:
[0179]
[0180]
[0181]
[0182] S2.3.3, Constructing the Local Adaptive Weight Fusion Module Branch: By pre-setting a local learnable scaling factor, the global representation features are weighted to obtain local weight features; Local template features; This is a local search feature.
[0183] This invention utilizes a locally learnable scaling factor The adaptive weight fusion module performs dynamic weight allocation to obtain local weight features. :
[0184]
[0185] S2.3.4. Based on local enhancement features, local augmentation features, and local weight features, element-wise summation is performed to obtain spatial features, thus completing the construction of the space-frequency coordination module;
[0186] In summary, the output of the local feature enhancement part, i.e., the first... Output spatial characteristics of the layer-space-frequency co-location module for:
[0187] .
[0188] S3. Based on spatial features, a spatiotemporal autoregressive network is constructed to aggregate historical information and enhance the spatial feature representation of the target. Under the hierarchical paradigm, historical information is injected as prior context to guide the temporal autoregressive attention modeling of the current frame. Then, a feedforward neural network is used to perform nonlinear transformation on the intermediate representation to improve the representation ability. Subsequently, a gating mechanism is used to guide the dynamic updating of historical information. Finally, a parameter-free dot product is used to enhance the spatial feature representation of the target region.
[0189] The proposed spatiotemporal autoregressive network consists of 12 identical spatiotemporal autoregressive modules stacked together. Its core is to explicitly model the inter-frame dependent interactive coupling attention, which provides temporal context at each time step and dynamically updates the target feature representation.
[0190] Specifically, the following steps are included:
[0191] S3.1 After setting the current frame as a spatial feature, the feature enhanced with historical information is obtained through the temporal attention module;
[0192] Let the current time be Then the historical aggregation feature is denoted as Feature representation extracted from the current frame First, and The bonds are obtained by passing through learnable linear mapping layers respectively. ), query ( ) and value ( The expression is as follows:
[0193]
[0194]
[0195]
[0196] Then, using keys, values, and queries as inputs, temporal attention modeling is performed to obtain intermediate spatial features. :
[0197]
[0198] in, This represents the attention calculation function; Represents the normalized exponential function, This indicates the transpose operation. This represents the dimension of the key vector. Next, we will... and After performing residual connections and normalization, we obtain features enhanced with historical information. .
[0199] S3.2 Input the historical information-enhanced features and the current frame, and obtain the structured information-enhanced features through the multi-head attention module;
[0200] This invention will and The input for multi-head attention is obtained through learnable linear mapping layers. ):
[0201]
[0202]
[0203]
[0204] The multi-head attention layer consists of 16 time-attention sub-layers connected in parallel. The outputs of each sub-layer are concatenated and then mapped to a single output matrix. Linear projection is used to re-integrate spatial features:
[0205]
[0206] in, This represents the output of the multi-head attention layer. This indicates multi-head attention computation. This indicates a splicing operation. Indicating the first in the multi-head attention layer The output of the time-attention sublayer. Similarly, and After performing residual connections and normalization, we obtain features with enhanced structured information. ;
[0207] S3.3 Input spatial information enhancement features, and through the feedforward neural network module, output structured information enhancement features;
[0208] Feedforward neural networks are often used for structured feature modeling, employing fully connected layers ( Perform linear projection and affine transformation, and introduce a nonlinear activation function to enhance expressive power. Feedforward output. for:
[0209]
[0210] Will and After performing residual connections and normalization, we obtain features with enhanced structured information. ;
[0211] S3.4 Input structured information to enhance features, and output historical aggregated features through the gated historical feature update module;
[0212] This invention takes into account the temporal continuity of the target state and employs a gating mechanism to guide the dynamic updating of historical features. Specifically, it utilizes... The weighted fusion between the historical aggregated features and the features from the previous frame yields the historical aggregated feature representation:
[0213]
[0214]
[0215] in, The gating vector controls the weighted fusion ratio; the activation function... , Represents a learnable linear mapping. It will be updated to the dynamic feature pool as Historical aggregation representation of moments;
[0216] S3.5 Input spatial features and historical aggregation features, and through the spatiotemporal feature fusion module, obtain the target perception after spatiotemporal attention enhancement, and complete the construction of the spatiotemporal autoregressive module;
[0217] In the spatiotemporal feature fusion stage, this invention introduces a parameter-free dot product enhancement of the spatial feature representation extracted by the space-frequency collaborative backbone network and historical aggregated features to adaptively amplify the discriminative response of the target region in the current frame. Specifically, the spatial feature output of the current space-frequency collaborative backbone network... Historical aggregation characteristics Attention map is obtained by calculating dot product. :
[0218]
[0219] in, This represents a parameterless dot product. Finally, use... We weight all lexical features in the current frame to obtain the spatiotemporally attention-enhanced target perception representation. .
[0220] S4. Target Localization. The enhanced features are reshaped into a two-dimensional spatial feature map, which is then input into a fully convolutional neural network to obtain the target's category probability distribution. Weighted focus loss and regression loss are used for target localization.
[0221] S4.1 Target perception based on spatiotemporal attention enhancement is reshaped into a two-dimensional spatial feature map and then input into a fully convolutional neural network to output the class probability distribution, bounding box and local offset.
[0222] In this invention, the weighted target-aware representation sequence is first reconstructed into a two-dimensional spatial feature map, and then input into a fully convolutional neural network consisting of multiple stacked convolutional layers. This network is further processed by 12 stacked convolutional-batch normalization-nonlinear activation layers, outputting a normalized bounding box size. and local offsets used to compensate for discretization errors caused by reduced resolution The probability distribution of categories provides crucial information for subsequent accurate target identification and localization.
[0223] S4.2 Based on the category probability distribution, bounding box, and local offset, the overall loss function is constructed for training to obtain the total loss value and perform target localization;
[0224] The overall loss function includes weighted focus loss, regularization loss, and generalized intersection-over-union loss;
[0225] During training, weighted focus loss is used. To optimize target prediction accuracy, regression loss improves the accuracy of bounding box localization. Specifically, the regression loss consists of regularization loss. And generalized intersection and comparison of losses Composition, the final overall loss function It can be represented as a weighted combination of multiple sub-loss functions:
[0226]
[0227] in, The preset generalized cross-union ratio loss parameters are used. These are preset regularization parameters used to adjust the relative weights between various loss terms;
[0228] S4.3. Take the position with the highest probability in the category probability distribution as the target position, apply the local offset to locate and obtain the target size, generate the final bounding box, and complete the spatiotemporal autoregressive single target tracking method based on frequency selection.
[0229] In the reasoning process, this invention selects the position with the highest score in the category probability distribution classification result as the position of the target.
[0230] Based on the detailed implementation description, the invention will be further illustrated through experiments. The data used, computer configuration, and experimental results are as follows:
[0231] 1. Experimental Data: This invention uses the OTMJ dataset as the benchmark dataset for experimental verification. The OTMJ dataset (released in 2024) is a comprehensive mountain jungle target tracking dataset containing 24 carefully labeled visible light target tracking sequences. These sequences were collected by drones, covering mountain jungle scenes from different times, locations, and altitudes, totaling more than 1200 video frames with a resolution of 640×512.
[0232] 2. Experimental Setup: The experimental part of this invention was implemented using the PyTorch 2.1 framework, with CUDA version 11.8 selected. Model training was performed on eight NVIDIA 4090 GPUs, while model testing was completed on a single NVIDIA 4060ti GPU. During training, the AdamW optimizer was used, combined with the cosine annealing strategy in the OneCycleLR learning rate scheduler to dynamically adjust the learning rate. Training was conducted on four classic visible light image datasets (GOT-10K, LaSOT, COCO, and TrackingNet), with 300 training epochs, an initial learning rate of 0.0004, and weight decay of 0.0001. During training, the batch size was 100, and after 240 epochs, the learning rate was reduced to one-tenth of its original value. The training set configuration followed the official recommendations. This invention uses the area under the success rate curve (AUC) and precision as evaluation metrics, with AUC being the primary evaluation metric.
[0233] To demonstrate the effectiveness of this method, five classic single-target tracking methods—OSTrack (ECCV 2022), SiamTPN (WACV 2022), AVTrack (ICML 2024), EVPTrack (AAAI 2024), and SUTrack-224 (AAAI2025)—were selected for comparative experiments.
[0234] 3. Experimental Results: Following the steps outlined above, experiments were conducted on the OTMJ dataset to validate the predictions. The results are shown in Table 1. Visualizations of the original images, ground truth values, and prediction results for some datasets are also provided. Figure 2 As shown.
[0235] Table 1: Prediction results of the OTMJ dataset
[0236]
[0237] In the above prediction results, all indicators of this invention are the optimal values.
[0238] As can be seen from Table 1, the area under the success rate curve and the accuracy of the prediction results of this invention on the OTMJ dataset have reached the optimal values, which indicates that this method can more effectively identify the spatial features of targets in mountainous jungle scenes compared with existing technologies. Figure 2The diagram shows the prediction effect of the present invention, where t is the current time, and t+1, t+2, t+3, t+4 and t+5 represent consecutive video frame numbers. It intuitively demonstrates the excellent performance of the present invention, which can effectively overcome the complex and ever-changing environmental changes in mountainous and jungle scenes, thereby further improving the accuracy and stability of target tracking in such scenes.
[0239] The specific embodiments of the present invention have been described in detail above with reference to the accompanying drawings. However, the present invention is not limited to the above embodiments. Within the scope of knowledge possessed by those skilled in the art, various changes can be made without departing from the spirit of the present invention.
Claims
1. A spatiotemporal autoregressive single-target tracking method based on frequency selection, characterized in that, Includes the following steps: S1. Based on the public dataset, extract the corresponding image pair data of the template frame and the search frame, and then perform preprocessing operations on the image pair data to generate a one-dimensional sequence of template frames and a one-dimensional sequence of search frames. After linear mapping to a preset dimension, one-dimensional positional codes are embedded to form template word sequences and search region word sequences. The template word sequences and search region word sequences are then concatenated in order to generate a concatenated word sequence. S2. Based on spliced word sequences and the Vision Mamba model, spatial features are output by constructing a space-frequency collaborative backbone network. The space-frequency cooperative backbone network includes a predetermined number of cascaded space-frequency cooperative modules; The space-frequency coordination module includes: a global representation modeling module and a local feature enhancement module; S2.
1. Based on the spliced word sequence and the Vision Mamba model, a space-frequency collaborative backbone network is defined. The expression for the space-frequency cooperative backbone network is: ; In the formula, This represents a one-dimensional convolution operation. Indicating the first in the space-frequency cooperative backbone network The feature modeling process of the layer-space-frequency coordination module, To concatenate word sequences; S2.2, A defined space-frequency cooperative backbone network, through definition The output of the previous layer is the output feature of the previous layer. The output feature of the previous layer is input into the parallel bidirectional state space model SSM module branch, the global selective frequency enhancement module branch and the global adaptive weight fusion module branch for processing at the same time. After outputting the global feature, global enhancement feature and global weight feature, they are added element by element to obtain the global representation feature, thus completing the construction of the global representation modeling module. S2.3 Based on global representation features, through parallel hybrid gating module branches, local selective frequency enhancement module branches, and local adaptive weight fusion module branches, local enhancement features, local weight features are output and then added element by element to obtain spatial features, thus completing the construction of the space-frequency coordination module; S3. Based on spatial features, a spatiotemporal autoregressive network is constructed to output target perception enhanced by spatiotemporal attention; The spatiotemporal autoregressive network includes a preset number of stacked spatiotemporal autoregressive modules; The spatiotemporal autoregressive module includes a temporal attention module, a multi-head attention module, a feedforward neural network module, a gated historical feature update module, and a spatiotemporal feature fusion module; S3.1 After setting the current frame as a spatial feature, the feature enhanced with historical information is obtained through the temporal attention module; S3.2 Input the historical information-enhanced features and the current frame, and obtain the structured information-enhanced features through the multi-head attention module; S3.3 Input spatial information enhancement features, and through the feedforward neural network module, output structured information enhancement features; S3.4 Input structured information to enhance features, and output historical aggregated features through the gated historical feature update module; S3.5 Input spatial features and historical aggregation features, and through the spatiotemporal feature fusion module, obtain the target perception after spatiotemporal attention enhancement, and complete the construction of the spatiotemporal autoregressive module; S4. Based on the target perception enhanced by spatiotemporal attention, the target is reshaped into a two-dimensional spatial feature map and then input into a fully convolutional neural network. The output includes the category probability distribution, bounding box, and local offset. The target is then trained by constructing an overall loss function to obtain the total loss value and perform target localization, thus completing the spatiotemporal autoregressive single target tracking method based on frequency selection.
2. The spatiotemporal autoregressive single-target tracking method based on frequency selection according to claim 1, characterized in that, S1 specifically includes the following steps: S1.1 Extract template frames based on public datasets. and search frame The corresponding image pairs data, where, and These represent the height and width of the template frame image, respectively. and These represent the height and width of the search frame, respectively; for the template frame and search frame After performing preprocessing operations, a one-dimensional sequence of template frames is obtained. and search frame one-dimensional sequence ; The preprocessing operation specifically involves: dividing the image into blocks according to preset pixels, and flattening the divided image blocks into a one-dimensional sequence of template frames. and search frame one-dimensional sequence ,in, To preset the resolution of each image patch, and These represent the number of blocks for the preset template frame and the search frame, respectively. S1.2 After mapping the template frame one-dimensional sequence and the search frame one-dimensional sequence to a preset dimension through a trainable linear mapping layer, one-dimensional positional encoding is embedded respectively to generate template word sequence and search region word sequence; The expressions for generating the template word sequence and the search region word sequence are as follows: ; ; In the formula, For template word sequence, For the search region word sequence, To preset linear mapping parameters, One-dimensional positional encoding can be learned from the template. For searching, learnable one-dimensional positional encoding; S1.
3. Concatenate the template word sequence and the search area word sequence in order to generate a concatenated word sequence; The expression for generating the concatenated word sequence is as follows: ; In the formula, This is for splicing word sequences.
3. The spatiotemporal autoregressive single-target tracking method based on frequency selection according to claim 2, characterized in that, S2.2 specifically includes the following steps: S2.2.1 Constructing the Bidirectional State-Space Model (SSM) Module Branch: Input the output features of the previous layer, and after performing normalization and linear projection operations, obtain the value vector and query vector; input the value vector into the Bidirectional State-Space Model (SSM) module, and then perform linear projection to generate the input matrix, output matrix, and time scale parameters; initialize the state transition matrix based on the high-order polynomial projection operator matrix, and after discretizing the state transition matrix and output matrix, perform SSM forward and backward calculations to obtain the global features; The expression for obtaining the global features is as follows: ; In the formula, As a global feature, Forward features It is a backward feature; Linear layer; S2.2.2 Constructing the global selective frequency enhancement module branch: After splitting the output features of the previous layer into template word sequences and search region word sequences, input them into the global selective frequency enhancement module respectively to obtain template enhanced features and search enhanced features. Then, they are concatenated to output the global enhanced features. The specific calculation process for outputting the global enhanced features is expressed as follows: ; ; ; in, For global template word sequence, For the global search region, word sequence Represents the activation function of the Gaussian error linear unit. This represents a two-dimensional Fast Fourier Transform. This represents the inverse two-dimensional fast Fourier transform. For global feature enhancement, To enhance features for templates, Enhance search features; S2.2.3, Constructing the global adaptive weight fusion module branch: The output features of the previous layer are weighted by a preset global learnable scaling factor to obtain global weight features; The expression for obtaining the global weight features is as follows: ; In the formula, For global weight features, Preset a globally learnable scaling factor; S2.2.
4. Based on global features, global enhancement features, and global weight features, element-wise summation is performed to obtain global representation features; The calculation expression for the global representation features is as follows: ; In the formula, It represents the global characteristics.
4. The spatiotemporal autoregressive single-target tracking method based on frequency selection according to claim 2, characterized in that, S2.3 specifically includes the following steps: S2.3.1 Constructing the hybrid gating module branches: After normalizing the global representation features and expanding the dimensions by one-dimensional convolution channels, the features are split into first features and second features by halving the channel dimensions. These features are then input into the two branches of the hybrid gating module, which output local features and local gating weights respectively. The local features and local gating weights are then gated and fused to obtain local enhanced features. The two branches of the hybrid gating module specifically include: a local feature extraction branch constructed from one-dimensional convolution, depthwise convolution, and channel attention, and a gating weight generation branch constructed from linear projection, Gaussian error, and linear unit activation function; The expression for the local enhancement feature is as follows: ; in, It is a local enhancement feature. For channel attention operations, For depthwise convolution operations, As the first feature, As the second feature, Represents the activation function of the Gaussian error linear unit; For element-wise multiplication; S2.3.2 Constructing the Local Selective Frequency Enhancement Module Branch: After splitting the global representation features into local template features and local search features, input them into the local selective frequency enhancement module to obtain local template frequency enhancement features and local search frequency enhancement features; concatenate the local template frequency enhancement features and local search frequency enhancement features to output the local enhancement features; The specific calculation expression for outputting local enhanced features is as follows: ; ; ; In the formula, This is a local template frequency enhancement feature. This is a feature to enhance the local search frequency. This represents a two-dimensional Fast Fourier Transform. This represents a one-dimensional convolution operation. This is a local enhancement feature; Local template features; For local search features; S2.3.3, Constructing the Local Adaptive Weight Fusion Module Branch: By pre-setting a local learnable scaling factor, the global representation features are weighted to obtain local weight features; The expression for obtaining the local weighted features is as follows: ; In the formula, Local weight features To preset a locally learnable scaling factor, As a global representation feature; S2.3.
4. Based on local enhancement features, local augmentation features, and local weight features, element-wise summation is performed to obtain spatial features, thus completing the construction of the space-frequency coordination module; The expression for the spatial features is as follows: ; In the formula, It is a spatial feature.
5. The spatiotemporal autoregressive single-target tracking method based on frequency selection according to claim 1, characterized in that, S3 specifically includes the following steps: Let the current time be Then the historical aggregation feature is denoted as , will the current frame After setting as spatial features First, and The keys are obtained through learnable linear mapping layers. Query Sum The expression is as follows: ; ; ; Then, using keys, values, and queries as inputs, temporal attention modeling is performed to obtain intermediate spatial features. The expression is as follows: ; in, This represents the attention calculation function; Represents the normalized exponential function, This indicates the transpose operation. Represent the dimension of the key vector; then... and After performing residual connections and normalization, we obtain features enhanced with historical information. ; Will and The input to the multi-head attention layer is obtained through a learnable linear mapping layer. ): ; ; ; The multi-head attention layer is composed of The system consists of several time-attention sub-layers connected in parallel. The outputs of each sub-layer are concatenated and then mapped to a single output matrix. Linear projection is used to re-integrate spatial features, as expressed below: ; in, This represents the output of the multi-head attention layer. This indicates multi-head attention computation. This indicates a splicing operation. Indicating the first in the multi-head attention layer The output of each time attention sublayer; and After performing residual connections and normalization, we obtain features with enhanced structured information. ; A fully connected layer is used to perform linear projection and affine transformation, and a nonlinear activation function is introduced to enhance expressive power. The output of the feedforward neural network... Defined as: ; In the formula, It is a fully connected layer. Represents the activation function of the Gaussian error linear unit; Then and After performing residual connections and normalization, we obtain features with enhanced structured information. ; Will Aggregation features with the previous frame We perform weighted fusion to obtain the historical aggregated feature representation: ; ; in, Represents the gate vector, express Activation function Represents a learnable linear mapping. It will be updated to the dynamic feature pool as Historical aggregation representation of moments; Attention maps are obtained by calculating spatial features and historical aggregation features using dot products. The calculation expression is as follows: ; in, Represents the parameterless dot product. The preset number of blocks for the search frame. For historical characteristics quantity; then We weight the feature vectors of all search regions in the current frame to obtain the target perception representation after spatiotemporal attention enhancement. .
6. The spatiotemporal autoregressive single-target tracking method based on frequency selection according to claim 1, characterized in that, S4 specifically includes the following steps: S4.1 Target perception based on spatiotemporal attention enhancement is reshaped into a two-dimensional spatial feature map and then input into a fully convolutional neural network to output the class probability distribution, bounding box and local offset. S4.2 Based on the category probability distribution, bounding box, and local offset, a total loss function is constructed for training to obtain the total loss value and perform target localization; S4.
3. Using the position with the highest probability in the category probability distribution as the target location, local offset is applied to locate and obtain the target size, generating the final bounding box, thus completing the spatiotemporal autoregressive single-target tracking method based on frequency selection.
7. The spatiotemporal autoregressive single-target tracking method based on frequency selection according to claim 6, characterized in that, In S4.2, the overall loss function The expression is: ; In the formula, For weighted focus loss, For generalized intersection and comparison of losses, For regularization loss, The preset generalized cross-union ratio loss parameters are used. These are the preset regularization parameters.