A cause-effect constraint feature dimension linear modulation method and a target tracking method
By introducing a causal constraint feature dimension linear modulation method into visual target tracking technology and explicitly injecting historical distribution priors, the robustness of feature extraction networks under low resolution and resource constraints is solved, feature recognition and anti-drift capabilities are improved, and more stable target tracking results are achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- NORTHWESTERN POLYTECHNICAL UNIV
- Filing Date
- 2026-01-19
- Publication Date
- 2026-06-19
Smart Images

Figure CN122244092A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of visual target tracking technology, and particularly relates to a causal constraint feature dimension linear modulation method and a target tracking method. Background Technology
[0002] The task of visual target tracking is to continuously output the position of an arbitrary target in subsequent frames after the first frame of a video sequence has been given. In real-world scenarios, targets often undergo deformation, rotation, scale changes, illumination changes, occlusion, and the appearance of similar interfering objects, causing the target's appearance to change significantly over time, which in turn leads to tracking drift or target loss.
[0003] In engineering applications, to meet constraints of real-time performance, power consumption, and cost, tracking systems often need to operate under resource-constrained conditions, such as limited computing power or video memory on edge devices, or the need to process longer video sequences. Under these conditions, the input image resolution is often reduced to decrease computational load and video memory usage. However, low resolution leads to loss of target details, weakened texture information, and blurred boundaries, making feature extraction networks more susceptible to background and similar interference, resulting in decreased target discrimination ability. This is especially true in scenarios involving long sequences, strong interference, and occlusion recovery, where cumulative errors and drift are prone to occur.
[0004] Existing tracking methods for changes in target appearance are typically spatiotemporal information propagation / cue methods. These methods utilize historical context by propagating tokens or cue information between frames to reduce the instability caused by discrete threshold updates. However, these methods often rely on deep attention interactions for temporal fusion, and may still face problems of insufficient robustness or computational overhead under resource-constrained or low-resolution conditions.
[0005] Therefore, there is a need for a target tracking technology that can introduce historical priors and suppress error accumulation in the early stages of feature extraction without significantly increasing computational and memory overhead, so as to improve robustness and stability in complex scenarios. Summary of the Invention
[0006] The purpose of this invention is to provide a linear modulation method for causal constraint feature dimensions and a target tracking method. By adding shallow feature processing, the method enhances the adaptability and robustness to complex appearance changes such as deformation, occlusion recovery and interference from similar targets.
[0007] This invention adopts the following technical solution: a linear modulation method for causal constraint feature dimensions, comprising the following steps: Obtain the search region token sequence output by the backbone network based on the search image of the current frame; Based on the historical conditional modulation method, the historical context vector of the historical memory queue is injected into the search region token sequence in a channel-level manner to generate the modulated search region token sequence; wherein, the historical memory queue includes several search region token sequences that are adjacent to the current frame and located before the current frame; The modulated search region token sequence is concatenated with the template token sequence output by the backbone network to form a complete token sequence, which is then fed into the deep attention encoder of the backbone network to extract deep features of the search image in the current frame.
[0008] Another technical solution of the present invention: a target tracking method, including the above-mentioned linear modulation method of causal constraint feature dimension.
[0009] Another technical solution of the present invention: a linear modulation device for causal constraint feature dimension, comprising a memory, a processor, and a computer program stored in the memory and running on the processor, wherein the processor implements the above-described method when executing the computer program.
[0010] The beneficial effects of this invention are: by explicitly injecting historical distribution priors in the shallow feature extraction stage through CausalFiLM, this invention enables the model to obtain effective constraints on the evolution of historical appearance before entering the deep interaction module, thereby improving feature recognition and anti-drift capabilities under conditions such as interference from similar targets, occlusion recovery, and low-resolution input. Attached Figure Description
[0011] Figure 1 This is a schematic diagram of a linear modulation method for causal constraint feature dimensions according to an embodiment of the present invention. Detailed Implementation
[0012] The present invention will now be described in detail with reference to the accompanying drawings and specific embodiments.
[0013] In recent years, low-resolution image input versions of visual trackers have often outperformed high-resolution image input versions of the same methods. However, current mainstream visual target tracking algorithms are largely based on Transformer-based trackers, whose computational complexity increases exponentially with the input sequence. High-resolution image input often comes with even higher computational cost and lower efficiency. Therefore, how to effectively utilize historical information, enhance feature discriminative power, and suppress error accumulation under constraints such as low resolution or limited resources is a crucial problem that current tracking technologies need to address.
[0014] In existing technologies, methods such as SiamFC proposed by Bertinetto et al. and subsequent methods such as SiamRPN, although they utilize depth features for matching, only use the first frame as a fixed template. No matter how the target is deformed in subsequent videos, they cannot introduce new appearance information, resulting in limited long-term tracking performance.
[0015] While STARK proposed by Yan et al. and MixFormer proposed by Cui et al. introduce dynamic template mechanisms, they rely on designing additional prediction head networks (such as score prediction branches) to evaluate the quality of the current frame and setting a fixed threshold to decide whether to replace the old template with the current frame. This discrete "update" operation not only disrupts the continuity of spatiotemporal information but also heavily depends on the tuning of hyperparameters.
[0016] Similarly, STMTrack proposed by Fu et al. introduces a complex memory network to store historical information. Although it utilizes spatiotemporal context, it leads to increased complexity in the model structure and reduced running efficiency.
[0017] The EVPTrack proposed by Shi et al. is most similar to this method. It successfully avoids the decision-making problem in traditional dynamic template updates by introducing explicit visual cues, but its temporal information fusion mainly relies on the token interaction process of deep Transformer.
[0018] Under resource-constrained or low-resolution input conditions, the decrease in discriminative power due to insufficient shallow feature details is more pronounced. Without shallow conditional enhancement mechanisms for historical states, the model may still exhibit insufficient robustness in scenarios with similar interference, long-term occlusion, and error accumulation. Furthermore, the computational and memory overhead of hint fusion and token interaction imposes certain resource constraints during long-sequence training or edge deployment.
[0019] Shallow features generally refer to a set of feature maps (or token maps) that retain high spatial resolution even after the backbone network has undergone one downsampling / patch merging process. Their primary information type typically falls between pure low-level texture and high-level semantics.
[0020] (1) Combination of local shapes and contours: edges / corner points have been aggregated into more stable contours and local geometry.
[0021] (2) Part-level patterns: such as a part of the structure of the target (wheel / head outline / clothing local texture block), but usually not yet to the stable category semantics.
[0022] (3) Stronger invariance but still retains localization details: more resistant to illumination / slight deformation / noise than stage 1; retains more spatial details than stage 3, suitable for fine localization. (Here, "stage" refers to the three stages of the backbone network HiViT. The first two stages mainly feature shallow features, while the latter stage 3 is a deep attention layer.)
[0023] The technical disadvantages of shallow features in existing technologies: (1) The shallow feature stage lacks historical prior constraints.
[0024] Existing cue-based target tracking methods primarily utilize temporal information through deep token interactions and attention fusion. However, they often lack explicit conditionalization mechanisms for historical states in the shallow stages of feature extraction networks. Consequently, in scenarios with low-resolution input, similar interference, or occlusion recovery, shallow representations are easily affected by background textures and noise, resulting in insufficient target discriminability.
[0025] (2) Shallow time series modeling has a high engineering cost.
[0026] Existing solutions for shallow temporal augmentation often rely on additional temporal attention, memory networks, or multi-branch interaction modules, which can easily lead to increased computational load, memory usage, and inference latency. This makes it difficult to simultaneously achieve real-time performance and long-sequence information mining under resource-constrained deployment or long-sequence training conditions.
[0027] (3) The use of historical information lacks a continuous time window mechanism.
[0028] Existing tracking methods, regardless of whether they employ explicit online template updates or implicit memory writing / spatiotemporal token propagation, lack sufficient ability to model and constrain historical appearance information. For example, online template updates rely on confidence thresholds or scoring branches for discrete replacements, making it difficult for historical information to form a continuous and accumulative trajectory of appearance evolution. While implicit memory writing or token propagation methods can avoid explicit threshold settings, their historical information is often mixed in the form of deep attention interactions, lacking explicit constraints on maintaining the historical context. This can lead to historical priors being less effective in long sequences and increasing the risk of feature drift and tracking instability under conditions such as occlusion, similar interference, or sudden appearance changes.
[0029] (4) Feature modeling lacks adaptive adjustment capabilities based on historical conditions.
[0030] Existing feature extraction networks rely on deep attention to dynamically model the temporal evolution of target appearance, but lack the ability to dynamically adjust to historical states. They are unable to adaptively enhance or suppress features in the current frame based on changes in historical appearance distribution, thus limiting their robustness in scenarios with drastic changes in appearance and interference from similar targets.
[0031] To address the aforementioned shortcomings, the technical problems that this invention aims to solve include: (1) How to explicitly inject historical spatiotemporal priors into the shallow stage of feature extraction without significantly increasing computational load and memory usage, so as to improve feature discriminability in low-resolution and strong interference scenarios; (2) How to achieve shallow temporal information injection in a lightweight manner, so as to take into account both real-time reasoning and long temporal context utilization; (3) How to maintain a stable and controllable target appearance history window and standardize the statistical method and writing timing of historical context, so as to improve the encoder's effective utilization of historical appearance context temporal information; (4) How to enable the feature extraction network to adaptively adjust the current frame channel response according to the historical state, so as to enhance the adaptability to target appearance evolution.
[0032] This invention discloses a linear modulation method for causal constraint feature dimensions, such as... Figure 1 As shown, the process includes the following steps: obtaining the search region token sequence output by the backbone network based on the search image of the current frame; injecting the historical context vector of the historical memory queue into the search region token sequence at the channel level based on the historical conditional modulation method to generate the modulated search region token sequence; wherein, the historical memory queue includes several search region token sequences that are adjacent to the current frame and located before the current frame; concatenating the modulated search region token sequence with the template token sequence output by the backbone network to form a complete token sequence, and feeding it into the deep attention encoder of the backbone network to perform deep feature extraction of the search image of the current frame.
[0033] This invention explicitly injects historical distribution priors into the shallow feature extraction stage through CausalFiLM, enabling the model to obtain effective constraints on the evolution of historical appearances before entering the deep interaction module. This improves feature recognition and anti-drift capabilities under conditions such as interference from similar targets, occlusion recovery, and low-resolution input.
[0034] This invention proposes a cue-based target tracking method and system that incorporates a causal constraint-based linear modulation mechanism (CausalFiLM). The method integrates this modulation mechanism into existing explicit visual cue tracking architectures, forming an improved tracker, CFPTrack (Causal FiLM Prompted Tracker). Generally, existing explicit visual cue tracking architectures mainly include: a backbone network for extracting and encoding template and search region features, a cue encoder for generating explicit cues, and a prediction head for outputting the target location.
[0035] Without altering the existing main processes of cue construction, cue propagation, and localization prediction, CFPTrack inserts a Causal FiLM mechanism branch between the shallow feature extraction stage and the deep interaction stage of the backbone network. This branch modulates the shallow search features of the current frame with historical conditions, allowing historical appearance priors to be explicitly injected before entering deep interaction. This improves localization stability in complex interference scenarios while also ensuring engineering deployability under low resolution and resource constraints.
[0036] In target tracking tasks, the search region of the current frame is a local region determined based on the target state of the previous frame. The aim is to narrow the target search range, reduce computational complexity, and avoid background interference. Its core logic is to utilize the continuity of target motion (smoothness assumption) to limit the region where the target may appear in the current frame. The main methods for obtaining this information include: fixed expansion based on the target bounding box of the previous frame, adaptive expansion based on motion prediction, and dynamic adjustment based on target state feedback.
[0037] Specifically, the search image in the current t-th frame. First, shallow search feature token sequences are extracted through Patch Embedding and the first few layers of the backbone network in stages 1 and 2. Meanwhile, the system simultaneously maintains a historical memory queue, Mem, updated in a first-in, first-out manner, storing a token sequence of shallow search region features from the most recent frames. When performing tracking and inference on the current frame, shallow features from the most recent frames are read from the historical memory queue and statistically aggregated within the historical spatiotemporal context encoding to obtain a conditional vector representing the historical distribution. This conditional vector, as a feature vector containing rich temporal information, serves as a key input to the historical conditional modulation unit. Based on information about the historical appearance evolution of the target, it is used to perform shallow search of the token region in the current frame. Perform channel-by-channel modulation to obtain the modulated shallow search features. This allows historical priors to directly affect the shallow representation of the current frame in a low-overhead manner, improving the ability to suppress feature instability caused by factors such as similar interference, occlusion, and appearance evolution.
[0038] After completing shallow modulation, By concatenating the tokens with explicit visual cues (including multi-scale cues and spatiotemporal cues), a complete token input sequence is obtained. Finally, the deep attention encoder is input to complete the feature interaction and fusion across tokens, and outputs a high-level semantic feature representation for localization. Finally, the prediction head outputs the target position of the current frame.
[0039] CausalFiLM (Causal FiLM) mechanism and parameter generation based on causal constraints.
[0040] Existing technologies mostly rely on deep attention interaction modules for implicit temporal modeling, and rarely explicitly introduce historical distribution priors in the shallow feature stage. This leads to feature drift and unstable localization in complex scenarios such as low-resolution input, similar interference, and occlusion recovery.
[0041] To address this limitation, this invention proposes a feature dimension linear modulation mechanism based on causal constraints (CausalFiLM), which aims to construct a simple and efficient historical condition injection mechanism in the shallow feature space, enabling subsequent feature interactions and modeling to be effectively iterated and propagated based on the injected historical appearance prior feature representation.
[0042] In its implementation, this mechanism consists of three cooperating functional units: first, a history memory queue, which maintains a finite-length set of shallow historical features and provides a readable historical window; second, a History Spatio-Temporal Context Encoding (STCE) module, which performs spatio-temporal statistical aggregation and lightweight encoding on the features in the history memory queue to generate channel-level history context vectors; and third, a History-Conditioned Modulation (HCM) unit, which generates channel modulation parameters based on the historical context and performs linear modulation of the feature dimension on the shallow feature token of the current frame, thereby explicitly injecting the historical appearance prior before entering the deep interaction module. These three units together form a unidirectional information flow that satisfies temporal and causal constraints. The history memory and its encoding results are used to modulate the current frame, and the history memory is updated after the current frame modulation and inference are completed, thus achieving stable temporal injection under causal constraints.
[0043] (1) Modulation position and scope of action.
[0044] In this invention, a causal constraint-based linear feature dimension modulation mechanism (CausalFiLM) is used to perform historical conditional modulation on the shallow feature representation output by the backbone network. Therefore, the modulation position of this mechanism is set after the shallow encoding stage of the backbone network and before entering the deep feature interaction module, specifically between the output of stage 2 of the backbone network and the deep attention encoder (stage 3). This position is chosen because, at this point, the shallow features have completed preliminary structured encoding, still retaining sufficient local appearance and texture information, and the feature representation has a certain degree of stability, making it suitable as a feature carrier for historical appearance statistics and prior injection. Completing historical conditional modulation before entering deep attention interaction allows subsequent multi-layer attention interactions to iteratively propagate based on the injected historical prior representation, thereby enhancing the continuous utilization and robustness of cross-frame information without significantly increasing the computational complexity of attention.
[0045] This invention proposes a feature dimension linear modulation mechanism set in the shallow stage of the backbone network and located before the deep interaction module. Through channel-level linear modulation driven by historical context, the shallow features of the current frame are recalibrated at the channel level with historical conditions before entering the deep encoder, realizing the explicit injection of historical appearance distribution prior and improving tracking stability in low-resolution and complex interference scenarios.
[0046] (2) Historical memory queue management.
[0047] To preserve prior information about historical appearances, this invention establishes a historical memory queue for maintaining a time window of target appearance information. This queue is used to cache shallow feature maps from the last T frames, denoted as: (1) in, For the first Frame search region image The shallow token feature sequence is generated after shallow feature extraction in stages 1 and 2 of the backbone network. This queue is used to write the shallow features of the current frame into the queue, maintain queue elements in chronological order, reset the queue during sequence initialization or restart, and provide historical context information to the historical context encoding module. B is the batch size, which is the number of mini-batches during training. C represents the hidden dimension (i.e., the number of channels) of the shallow feature token, and H / W represents the width and height of the token.
[0048] Regarding the management of the historical memory queue, this invention adopts a fixed capacity setting and follows a first-in, first-out (FIFO) update principle. When the queue capacity reaches its limit, newly written features will replace the earliest written features and be placed at the end of the queue to ensure that the historical context always reflects the most recent appearance distribution while maintaining a certain temporal relationship. To reduce training instability and memory overhead caused by cross-frame backpropagation, and to make the historical spatiotemporal context encoding branch a conditional statistical rather than an explicit temporal backpropagation link, gradient truncation is applied to the shallow features written to the queue and their subsequent statistical aggregation process during training, so that the historical context manifests as stable conditional prior information during the training phase. Furthermore, the timing control of writing to the historical memory queue (i.e., a historical context management strategy based on causal relationships) will be further explained as a key mechanism and experimentally verified in subsequent sections.
[0049] For historical memory update mechanisms, sampling / writing strategies based on thresholds or quality scores can also be adopted. For example, the current frame can be written into historical memory only when the classification confidence is higher than a threshold or a certain quality indicator meets the condition, in order to reduce the pollution of abnormal frames. However, such strategies usually rely on manually set thresholds and hyperparameters, which are highly sensitive to different scenarios and data distributions. Furthermore, key frames may be missed or misselected under conditions such as target occlusion, rapid movement, or similar interference, thus affecting the integrity and stability of the historical context. The engineering implementation conditions are also more stringent.
[0050] (3) History Spatio-Temporal Context Encoding (STCE).
[0051] This invention proposes a Historical Spatiotemporal Context Encoding (STCE) module, which is equipped with a fixed-length historical memory queue to form a controllable historical window. It performs spatiotemporal statistical aggregation and lightweight encoding on shallow historical features to generate channel-level historical context vectors, compressing multi-frame historical information into a conditional representation usable for modulation, thus achieving low-overhead historical context construction. Furthermore, regarding the maintenance of historical memory, a management strategy is adopted that first, the current frame's modulation and localization inference are completed based on the historical memory, and then the shallow features of the current frame are written into the historical memory queue according to a first-in-first-out (FIFO) rule. This ensures that the current frame information does not participate in the modulation parameter statistics of this frame, standardizing the statistical method and writing timing of historical context.
[0052] The deep feature interaction and modeling module of the cue-based tracking framework encoder typically consists of multiple attention blocks. Explicitly introducing cross-frame interaction at this stage presents practical challenges such as selecting the insertion layer, defining the token scope, and significantly increasing computational / memory load. Furthermore, cross-frame information is easily weakened or buried during multiple interactions within the deep attention layer. Based on these considerations, this invention employs injecting historical context information after shallow feature extraction by the encoder before deep attention to complete cross-frame modulation, ensuring that temporal consistency is not compromised by subsequent complex interactions. Specifically, after shallow feature extraction, the historical appearance prior is encoded into a compact context vector, which is then conditionally modulated onto the current frame's shallow token sequence. This allows subsequent deeper attention to interact and be positioned on representations that are already constrained by history.
[0053] To address this, a historical spatiotemporal context encoder (STCE) was designed to encode the historical memory queue. m is mapped to a context vector containing historical priors. The method for generating historical context vectors can be abstractly represented as follows: (2) in, The historical context vector of the current frame. It is a spatiotemporal statistical aggregation operator used to compress historical sequences into channel statistics in the time and space dimensions; This is a lightweight embedding map used for representation alignment and scaling transformation of statistics. It is a sequence of historical memories.
[0054] In this method implementation, for the spatiotemporal statistical aggregation operator Considering that CausalFiLM modulation is a method centered on channel response recalibration, it is necessary to construct a channel-aligned historical distribution description vector that is independent of specific spatial locations and stable to changes in the order of historical frames, and to avoid introducing additional spatial structure assumptions and computational burdens. Therefore, this embodiment directly uses global mean aggregation as the actual spatiotemporal statistical aggregation operator. ,right Spatiotemporal compression is performed to obtain statistical vectors along the channel dimension, where T represents the channel, H represents the height, and W represents the width. Specifically, given a historical queue... First, mean aggregation is performed on the historical features in both the time and spatial dimensions to calculate the channel statistical vector for the current batch. : (3)
[0055] Where ":" indicates that the entire index is retrieved for that dimension, i.e. Indicates the first In the shallow feature map of the frame, take all batches and all channels in spatial location. The feature vector at that location, then... The summation and average are then performed.
[0056] Mean statistics can provide stable historical appearance evolution information at very low computational cost, smooth out occlusion and local noise, and effectively maintain the long-term bias of shallow input features in terms of texture, color, and local structure, making it suitable as a conditional input for generating subsequent channel modulation parameters. In addition to the mean, It can also be replaced with weighted mean, exponential moving average, or second-order statistics (such as variance) in different implementations to adapt to the appearance variation and noise characteristics of different scenarios.
[0057] For historical spatiotemporal information extraction, a spatiotemporal feature extraction method based on 3D convolution (Conv3D) can be adopted. This involves stacking shallow features from several historical frames and inputting them into a 3D convolutional network to learn spatiotemporal correlations, then fusing the output into the features of the current frame. This approach can explicitly model local spatiotemporal patterns, but its computational cost and parameter count are typically higher, and it is quite sensitive to the length of the input sequence and the feature resolution. Furthermore, in the experiments of this invention, this alternative method also failed to achieve a performance improvement comparable to CausalFiLM; instead, it showed a performance decline, indicating that it cannot exert its expected advantages under current cue-based tracking frameworks and low-resolution settings.
[0058] While methods such as cross-attention, 3D convolution, or thresholding can formally achieve historical information injection and temporal interaction, they often suffer from significantly increased computational and memory burdens, slower training convergence, or ultimately, performance degradation in the low-resolution and resource-constrained deployment scenarios emphasized in this invention. In contrast, the CausalFiLM and its historical context management strategy of this invention can achieve effective historical prior injection with lower engineering costs and obtain more stable overall results.
[0059] (4) History-Conditioned Modulation (HCM) and Parameter Generation.
[0060] In this step, the historical conditional modulation method includes: mapping the historical context vector to channel-level modulation parameters; expanding the channel-level modulation parameters to the same dimension as the search region token sequence; and performing channel-level conditional modulation on the search region token sequence with the expanded channel-level modulation parameters.
[0061] This invention proposes a history-conditional modulation unit (HCM), which uses a lightweight parameter generator to map the history context vector into channel scale and bias modulation parameters, and limits the modulation intensity through bounded amplitude constraints, making the modulation process stable and controllable, easy to integrate with the prompt-based tracking main process, and achieving effective shallow timing injection with low computational and memory overhead.
[0062] In obtaining historical context vectors Subsequently, this invention employs historical conditional modulation to inject this context into the shallow token sequence of the current frame at the channel level. Let the shallow token sequence of the current frame search region be... (N is the number of tokens), channel modulation parameters are generated through a lightweight parameter generation network g(⋅), and amplitude constraints are imposed on them to control the modulation intensity and improve training stability: (4) Where t represents the index of the current frame, For channel scale increments, For channel offset increment, It is an amplitude factor used to limit the upper bound of the modulation amplitude, avoiding feature distribution drift and convergence instability caused by over-emphasis in the early stages of training or under abnormal samples. The property of symmetry-boundedness naturally constrains the modulation parameters to a stable region, and is simple to implement and numerically stable. In practical implementation, the parameter generation network... A lightweight multilayer perceptron (MLP) is used to implement the historical context vector. This is mapped to channel-level modulation parameters. Specifically, It consists of several fully connected layers and nonlinear activations, and its output dimension is And the scale increment is obtained by splitting along the channel dimension. With bias increment .
[0063] To facilitate channel-by-channel modulation with the token representation. and Will be extended via broadcast And search the shallow token sequence of the current frame region. Perform channel-level conditional modulation. This process can be equivalently written as an affine transformation or residual injection form, where the residual form is expressed as: (5) in, This represents the Hadamard product. The modulated search region token sequence is used to inject historical priors as a controlled term into shallow features. While preserving the original representation backbone, it enhances the target-related channels and suppresses noise channels, thereby improving the matching and localization stability in the subsequent deep encoder interaction stage. It provides features with a prior constraint consistent with the evolution of the target appearance, thus improving robustness in complex interference scenarios without significantly increasing computational and memory burden. The token sequence of the search region before modulation.
[0064] In view of the invention’s purpose of introducing historical priors and improving robustness in complex scenarios at the shallow stage, other time-series interaction or historical information injection methods can theoretically achieve similar functions, but their engineering costs or actual effects are obviously insufficient. Specifically, these include the following alternatives and their differences in technical effects.
[0065] For shallow temporal information injection, a cross-attention-based temporal interaction module can be used. This module uses the current frame's shallow features as a query and historical memory features as key / value pairs for explicit alignment and fusion. While this approach offers stronger expressive power, it requires large-scale attention calculations on historical tokens in practice, significantly increasing computational load and memory consumption. Furthermore, it exhibits slow convergence and decreased training stability during training. In the experimental verification of this invention, replacing CausalFiLM with cross-attention leads to slower training convergence and varying degrees of degradation in final metrics, making it difficult to balance performance and efficiency in low-resolution and resource-constrained scenarios.
[0066] In this invention, the historical memory queue is a fixed-length first-in-first-out queue, employing a historical context management strategy and optimization objective of modulation followed by updating. It should be noted that after generating the modulated search region token sequence, the modulated search region token sequence is added to the historical memory queue.
[0067] Experiments have shown that the strategy of updating the current frame information before modulation for managing the historical context queue may lead to information leakage risks. This invention implements a queue management mechanism of modulation before updating. Specifically, when processing the current frame t, the system first retrieves the historical features of the most recent T frames from the memory queue. At this time, the queue strictly excludes the information of the current frame to ensure that the stored content is a pure historical distribution.
[0068] The system generates modulation parameters using retrieved historical features. and Search for features of the current frame region Perform channel-level linear modulation to obtain enhanced features. The data is then fed into a deep network to perform subsequent interaction and prediction tasks. After completing the modulation and inference preparation steps described above, the system performs a queue update operation, which pushes the current frame features, after gradient truncation, to the end of the queue while removing the oldest frame. This strategy, through strict timing control, ensures that the conditional input of CausalFiLM modulation always originates from past historical appearance information, thus fundamentally cutting off the path of current frame noise feeding back to itself through the queue, effectively avoiding the risk of entrapment and accumulation of error states during tracking.
[0069] During the training phase, data is organized using the sequence unrolling method proposed by EVPTrack. Each training batch consists of N video sequences, with each sequence continuously sampled for M frames. The model iteratively trains the sequences according to the time steps, simulating a realistic tracking process.
[0070] To ensure consistent model behavior during training and inference phases and effectively mitigate the risk of cross-sequence data contamination, this invention implements a strict historical appearance memory queue maintenance strategy. Specifically, in training mode, the system initializes the queue at the beginning of each batch computation and immediately clears it after batch processing, thereby strictly blocking feature interference between different batches. In inference mode, the queue is initialized only at the beginning of the video sequence and is continuously and dynamically maintained during subsequent frame-level processing. Furthermore, to address the cold start problem caused by insufficient historical frames (T frames) at the beginning of the sequence, the system adopts an adaptive processing method of copying and filling the first frame until the queue accumulates sufficient historical features, thereby effectively eliminating performance fluctuations that may be caused by initial distribution differences and ensuring a smooth start to the tracking process.
[0071] Furthermore, to ensure that the module does not disrupt the convergence trajectory of the baseline training in the early stages of training, this invention employs identity initialization for the final layer of the parameter generation network, ensuring that the initial stage... This allows the modulation module to initially behave as an approximately identical mapping, and then the data-driven model learns when to modulate, which channels to modulate, and at what amplitude.
[0072] Tracking models and datasets.
[0073] In constructing the tracking model, a cue-based Transformer tracking architecture is used as the foundation, with improvements made using EVPTrack as the implementation platform. Specifically, following the EVPTrack model settings, HiViT-Base is selected as the backbone network of the image-prompt encoder, and a masked autoencoder is employed for parameter initialization. Building upon this, this invention introduces a causal constraint-based historical feature modulation (CausalFiLM) mechanism in the shallow layers of the backbone network, thereby forming the improved low-resolution cue-based tracker CFPTrack. Given that this invention targets resource-constrained and low-resolution image input scenarios, it primarily trains and evaluates a model variant with an input resolution of 224×224 to highlight its robustness and engineering deployability advantages under low-resolution conditions.
[0074] Regarding training data, to ensure the model's generalization ability and scene coverage, this invention uses mixed data from four mainstream public datasets for offline training, including LaSOT, GOT-10k, TrackingNet, and COCO. In the testing and evaluation phase, to comprehensively measure the model's tracking ability in different scenarios, it is validated on three benchmark datasets with different task characteristics: LaSOT (long-term tracking benchmark), (Extended benchmark) and GOT-10k (General Target Tracking Benchmark). Among them, Compared to LaSOT, which contains more similar interference and complex appearance change sequences, this invention can more fully demonstrate its improved robustness in complex interference scenarios.
[0075] This invention fully considers the continuity of target motion in video sequences and the diversity of real-world environments. Multiple data augmentation techniques are applied to the input image. Specifically, these include horizontal flipping to simulate changes in target orientation and brightness jitter to simulate instability in lighting conditions. For image size, the template image is cropped to 112*112 pixels, and the search region is cropped to 224*224 pixels.
[0076] This invention generally follows the inference paradigm of the cue-based tracking method EVPTrack, that is, without relying on the explicit template update strategy of traditional threshold decision, it guides feature fusion and localization prediction through explicit visual cues. In terms of cue construction, consistent with EVPTrack, it adopts two types of complementary cues to cover typical challenges such as scale changes and temporal appearance evolution. The first is multi-scale cues, which form cue tokens of different granularities by extracting multi-resolution features from the template image to enhance the representation ability of target scale changes and local details; the second is spatiotemporal cues, which explicitly inject short-term historical information into the cue representation of the current frame by propagating and updating spatiotemporal tokens between adjacent frames, thereby improving the adaptability to deformation, occlusion and appearance changes.
[0077] Unlike existing cue-based strategies, this invention introduces a causal constraint-based historical feature modulation mechanism (CausalFiLM) before the aforementioned cue-driven deep interaction. In each frame's inference, a historical context is first constructed based on a historical memory queue, and channel-level modulation is performed on the shallow features of the current frame. This is then concatenated with multi-scale and spatiotemporal cues before being input into the subsequent encoder to complete interaction fusion. By explicitly injecting historical distribution priors at the shallow stage, this invention can more effectively suppress feature drift and enhance localization stability in low-resolution and highly interfering scenarios.
[0078] In terms of inference output and post-processing, this invention maintains the same prediction process as cue-based tracking: the prediction head outputs target candidate boxes and confidence scores, and in post-processing, a Hanning window penalty is applied to the classification score to improve trajectory smoothness. Finally, the candidate box with the highest score is selected as the tracking result of the current frame.
[0079] To quantify the performance of this invention, the following five evaluation metrics are mainly used: Success Rate (AUC) is used to calculate the predicted bounding box. ) and the true bounding box ( The Intersection over Union (IoU) between frames is calculated by varying the IoU threshold within the range of [0, 1], and then plotting the success rate curve for each threshold. The area under the curve (AUC) is the numerical value of this metric. (6) Precision (P) is calculated by determining the center point of the predicted bounding box. ) and the center point of the true bounding box ( The Euclidean distance between the center points of a video sequence and the center point error is less than a specific threshold (usually 20 pixels). (7) Normalized Precision This metric, based on the accuracy calculation, introduces the target size as a normalization factor. Specifically, the center point error is divided by the diagonal length of the target's true bounding box to eliminate the influence of target size on distance error assessment.
[0080] Average Overlap (AO): This metric represents the average IoU value between the predicted bounding box and the ground truth bounding box across all frames in the entire video sequence.
[0081] Computational cost (Floating Point Operations, FLOPs): Records the number of frames processed per second on a single GPU and the computational complexity of the model during inference.
[0082] Of the five parameters mentioned above, the smaller the FLOPs value, the lower the computational cost of the model and the more lightweight the model; AUC, P, The larger the AO value, the higher the tracking accuracy and the stronger the robustness of the model.
[0083] To verify the effectiveness and deployability of the method of this invention under low-resolution input conditions, this section selects LaSOT, The LaSOT benchmark was compared with three publicly available benchmarks, GOT10K. The comparison methods covered classic trackers and representative tracking methods from recent years; several methods in the table below used low resolution settings such as 224 or 256 to allow for performance comparisons with similar computational costs. Regarding evaluation metrics, LaSOT and... Using AUC and normalized precision ( () and accuracy (P), GOT10K adopts AO, and .
[0084] Table 1 presents the main comparative results. The method of this invention achieved stable and representative improvements on all three benchmarks. Using the closest prior art, EVPTrack-224, as the baseline, this invention achieved scores of 70.6 / 80.9 / 77.5 on LaSOT, maintaining overall performance consistent with EVPTrack-224's 70.4 / 80.9 / 77.2 while gaining a slight improvement. This demonstrates that on conventional long-term test sets, this invention can improve localization stability without altering the core flow of the original cue-based tracking framework. More importantly, it also improves performance on more challenging test sets. The present invention achieves scores of 50.2 / 61.1 / 57.5, which is a more significant improvement compared to EVPTrack-224's 48.7 / 59.5 / 55.1, as reflected in AUC. P and P are increased by approximately +1.5 / +1.6 / +2.4, respectively. These results demonstrate that by introducing historical conditional modulation and employing a causal historical context management strategy at the shallow stage, this invention can more effectively suppress drift and error accumulation in scenarios with similar interference and complex appearance evolution, thus exhibiting stronger robustness gains on more challenging data distributions.
[0085] Table 1
[0086] On GOT10K, this invention achieves scores of 73.7 / 84.0 / 72.2, compared to 73.3 / 83.6 / 70.7 for EVPTrack-224, in terms of AO and Achieve stable improvement and in more stringent The improvement of approximately +1.5 in the performance metrics indicates that the present invention has a stronger gain for scenarios with higher requirements for high-precision positioning, and can maintain more reliable positioning quality when the target state changes or interference increases.
[0087] Furthermore, compared to representative methods of various low-resolution versions, this invention demonstrates superior performance in LaSOT and... Overall, it is at a leading or competitive level within the same / near-resolution settings. For example, in Compared to representative methods with 256 inputs, the present invention still maintains an advantage or a similar level, indicating that the shallow history prior injection mechanism of the present invention can effectively compensate for the decrease in discriminability caused by insufficient detailed information at lower input resolutions, thereby balancing performance and deployment cost.
[0088] In summary, the experimental results verify the effectiveness of the present invention in low-resolution and resource-constrained scenarios, and the main gains are concentrated on benchmarks with more interference and higher appearance evolution complexity, which is in line with the design goal of the present invention for robust tracking in complex scenarios.
[0089] To further verify the role of each part in the method of the present invention, ablation experiments and analyses were also conducted.
[0090] (1) Queue length ablation.
[0091] To determine the optimal timing window size, this invention conducted a comparative experiment on the length q of the historical memory queue under a 224 resolution configuration and a pre-modulation enqueue strategy. The experimental results are shown in Table 2 below.
[0092] Table 2
[0093] Ablation experiments were conducted on the LaSOT dataset with different historical information memory queue lengths. The results show that the model performs best on the LaSOT dataset with an AUC score of 70.6 when the queue length is set to q = 4. In contrast, excessively short queues (q = 2) or excessively long queues (q = 8) lead to slight performance decreases, to 70.2 and 70.0 respectively. This phenomenon reveals the inherent trade-off mechanism in the selection of temporal prior length: under sufficient training paradigms, a moderate historical window can provide robust temporal context to aid tracking; however, an excessively short window cannot provide sufficient historical statistical information, while an excessively long window may introduce stale features, leading to error accumulation or overfitting to specific appearances in long-term tracking.
[0094] (2) Dissolving historical context management strategies.
[0095] To evaluate the sensitivity of the CausalFiLM module to historical context management strategies, ablation experiments were conducted with an input resolution of 224. Except for the context writing strategy, the training and model configuration remained consistent, including a batch size of 4×8 (N=4, M=8, i.e., a batch containing a temporal video sequence of 4 one-frame template images plus 8 search region images), a historical context queue length of 4, and the same shallow insertion positions and parameter generation network configuration.
[0096] Two historical context writing methods are compared: one is a writing strategy that updates the appearance information of the current frame before modulation, that is, the shallow features of the current frame are first written into the first-in-first-out historical queue, and then the modulation parameters are generated based on the queue statistics including the current frame and the current frame feature modulation is completed; the other is a writing strategy that modulates first and then updates the appearance information of the current frame, that is, the modulation parameters are generated only based on the historical queue to complete the current frame feature modulation and localization inference, and then the shallow features of the current frame are written into the queue for use by subsequent frames.
[0097] Table 3
[0098] The experimental results are shown in Table 3. On LaSOT, the differences between the two strategies are small; on the more challenging... The above method, employing inference before updating, achieves a more significant improvement. This result demonstrates that writing the current frame information into the historical queue after inference is complete helps reduce the risk of abnormal frames immediately contaminating historical statistics, thereby improving tracking stability and robustness in complex interference scenarios.
[0099] (3) Analysis of computational complexity and parameter overhead.
[0100] To evaluate the cost of the proposed shallow timing modulation module in engineering deployment, this invention statistically analyzes the number of model parameters and computational cost (FLOPs) and compares it with a representative low-resolution tracker.
[0101] Statistical results show that introducing CausalFiLM on top of EVPTrack-224 only slightly increases the number of model parameters, by a mere 0.45% compared to the baseline. Furthermore, because modulation parameter generation and channel modulation are located in shallow layers with short computational paths, the overall impact on FLOPs is negligible, delivering performance gains without significantly increasing the inference computational burden. In contrast, some larger-scale trackers (such as SeqTrack) have significantly higher parameter counts and FLOPs. Despite similar input resolutions, their computational costs are less favorable for resource-constrained or real-time deployment scenarios.
[0102] Table 4
[0103] (4) Effectiveness analysis of the feature dimension linear modulation (CausalFiLM) mechanism based on causal constraints.
[0104] This invention introduces a Causal Feature Dimension Linear Modulation (CausalFiLM) mechanism into the shallow stages of the backbone network, explicitly injecting the historical appearance distribution prior into the shallow feature representation of the current frame before entering the deep Transformer interaction module. The core objective of this design is to enhance feature discriminativity and suppress drift in low-resolution and more interference-prone scenes, while maintaining the computational efficiency of the end-to-end cue propagation framework as much as possible.
[0105] Experimental results show that introducing CausalFiLM onto the EVPTrack-224 baseline creates a novel tracker, CFPTrack, which incorporates historical modulation cues. This model achieves stable gain across multiple benchmarks. 1) In The more significant improvements were achieved (e.g., more obvious improvements in AUC and accuracy indicators), indicating that shallow history conditional modulation has a stronger robustness gain for sequences with stronger similar interference and more complex appearance evolution. 2) On GOT-10k, AO, and Both were improved, indicating that the module can bring benefits in both overall generalization and strict threshold success rate metrics; 3) Based on complexity statistics, this gain is achieved with almost no increase in FLOPs, making it suitable for resource-constrained or real-time deployment scenarios.
[0106] Considering fairness and length, this invention mainly emphasizes the improved robustness and engineering efficiency advantages demonstrated on low-resolution settings and more interference-prone benchmarks (such as LaSOT_ext and GOT-10k).
[0107] In the research and verification process of this invention, an improved tracking method incorporating causal historical feature modulation was proposed and implemented, focusing on the explicit visual cue tracking paradigm. Specifically, based on the overall end-to-end process of the cue tracker, this invention adds a causal constraint-based linear modulation mechanism (CausalFiLM) to the shallow layers of the backbone network. This allows the appearance distribution prior represented by the historical memory queue to be explicitly injected into the shallow features of the current frame in a lightweight manner, thus addressing the problem of insufficient utilization of shallow temporal priors in existing cue-based methods. Simultaneously, for the management strategy of the historical information memory queue, a context management strategy is adopted that first completes modulation and localization inference, and then updates the historical memory. This reduces the additional introduction of historical statistical information into the current frame, enhances the stability of long-sequence tracking, and suppresses error accumulation. Experimental results show that this invention achieves stable gains on multiple public benchmarks under low-resolution input settings, and the gains are more significant on more challenging extended datasets, demonstrating improved robustness to similar interference, occlusion recovery, and complex appearance evolution scenarios. Without increasing computational load, this invention can further improve tracking robustness on the basis of the baseline model. Further ablation analysis verified the critical role of history memory queue configuration and writing strategy in performance. In summary, this invention demonstrates that introducing shallow causal history modulation into a cue-based tracking framework is an effective and lightweight improvement path, providing a feasible technical solution for robust target tracking under resource-constrained conditions.
[0108] The present invention also discloses a target tracking method, including the above-mentioned linear modulation method of causal constraint feature dimension.
[0109] This invention provides a technical solution for introducing shallow causal history modulation into a cue-based target tracking framework. This improves tracking robustness and stability in complex scenarios without relying on traditional online template update threshold decisions and complex scoring branches. Compared to existing cue-based tracking methods that primarily rely on deep interaction for temporal modeling while neglecting the utilization of historical priors in the shallow stage, this invention explicitly injects historical distribution priors into the shallow feature extraction stage using CausalFiLM. This allows the model to obtain effective constraints on historical appearance evolution before entering the deep interaction module, thereby improving feature recognition and anti-drift capabilities under conditions such as similar target interference, occlusion recovery, and low-resolution input. Furthermore, this invention achieves efficient utilization of shallow temporal information with lower computational and memory overhead, exhibiting good real-time performance and engineering deployment adaptability.
[0110] The present invention also discloses a linear modulation device for causal constraint feature dimension, including a memory, a processor, and a computer program stored in the memory and running on the processor, wherein the processor implements the above-described method when executing the computer program.
[0111] The present invention also discloses an embodiment that provides a computer-readable storage medium storing a computer program, which, when executed by a processor, implements the steps in the above-described method embodiments.
[0112] The present invention also provides a computer program product that, when run on a data storage device, enables the data storage device to implement the steps in the above-described method embodiments.
[0113] If the integrated unit module is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, all or part of the processes in the methods of the above embodiments of the present invention can be implemented by a computer program instructing related hardware. The computer program can be stored in a computer-readable storage medium, and when executed by a processor, it can implement the steps of the various method embodiments described above. The computer program includes computer program code, which can be in the form of source code, object code, executable files, or certain intermediate forms. The computer-readable medium can include at least: any entity or device capable of carrying computer program code to a storage device, a recording medium, a computer memory, a read-only memory (ROM), a random access memory (RAM), an electrical carrier signal, a telecommunication signal, and a software distribution medium. Examples include USB flash drives, portable hard drives, magnetic disks, or optical disks.
[0114] In the above embodiments, the descriptions of each embodiment have different focuses. For parts that are not described in detail or recorded in a certain embodiment, please refer to the relevant descriptions of other embodiments.
[0115] Those skilled in the art will recognize that the algorithmic steps of the various examples described in conjunction with the embodiments disclosed in this invention can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementations should not be considered beyond the scope of this invention.
[0116] It should be noted that the data used in the implementation of this invention were all collected or gathered through legal and compliant channels, and the collection and gathering activities fully comply with the requirements of relevant laws, regulations and industry standards; the existing technical methods involved in this invention were also obtained and used through legal and compliant means.
Claims
1. A linear modulation method for causal constraint feature dimensions, characterized in that, Includes the following steps: Obtain the search region token sequence output by the backbone network based on the search image of the current frame; Based on the historical conditional modulation method, the historical context vector of the historical memory queue is injected into the search region token sequence in a channel-level manner to generate the modulated search region token sequence; wherein, the historical memory queue includes several search region token sequences that are adjacent to the current frame and located before the current frame; The modulated search region token sequence is concatenated with the template token sequence output by the backbone network to form a complete token sequence, which is then fed into the deep attention encoder of the backbone network to extract deep features of the search image in the current frame.
2. The linear modulation method for causal constraint feature dimensions as described in claim 1, characterized in that, The historical conditional modulation method includes: Map the historical context vector to channel-level modulation parameters; Extend the channel-level modulation parameters to the same dimension as the search region token sequence; Channel-level conditional modulation is performed on the search region token sequence using the extended channel-level modulation parameters.
3. The linear modulation method for causal constraint feature dimensions as described in claim 2, characterized in that, Mapping the historical context vector to channel-level modulation parameters includes: , Where t represents the index of the current frame, For channel scale increments, For channel offset increment, For amplitude factor, It is the hyperbolic tangent function. This indicates that the parameter generation network is implemented using a lightweight network.
4. The linear modulation method for causal constraint feature dimensions as described in claim 3, characterized in that, The channel-level conditional modulation includes affine transformation or residual; when it is a residual, it includes: , in, The modulated search region token sequence. The token sequence of the search region before modulation. It is the Hadamard product.
5. A linear modulation method for causal constraint feature dimensions as described in any one of claims 2-4, characterized in that, The historical memory queue is a fixed-length first-in-first-out queue; After generating the modulated search region token sequence, the modulated search region token sequence is added to the historical memory queue.
6. The linear modulation method for causal constraint feature dimensions as described in claim 5, characterized in that, The method for generating the historical context vector is as follows: , in, The historical context vector of the current frame. This is a spatiotemporal statistical aggregation operator used to compress historical memory sequences into channel statistics in both the temporal and spatial dimensions. This is a lightweight embedding map used for characterization alignment and scaling transformation of channel statistics. It is a sequence of historical memories.
7. The linear modulation method for causal constraint feature dimensions as described in claim 6, characterized in that, Global mean aggregation, weighted mean, exponential moving average, or second-order statistics are used as the spatiotemporal statistical calculation methods. .
8. The linear modulation method for causal constraint feature dimensions as described in claim 5, characterized in that, The historical context vector is generated using cross-attention or spatiotemporal feature extraction based on 3D convolution.
9. A target tracking method, characterized in that, The method includes a linear modulation method for causal constraint feature dimensions as described in any one of claims 1-8.
10. A linear modulation apparatus for causal constraint feature dimensions, comprising a memory, a processor, and a computer program stored in the memory and running on the processor, characterized in that, When the processor executes the computer program, it implements the method as described in any one of claims 1-8.