A target tracking method based on RGB-E in a complex driving scene
By combining RGB images and event camera data with the MDNet multi-domain learning network, and using the ASIE cross-spatial information extraction module and the DMSTF dual-modal spatiotemporal fusion module, the stability and adaptability issues of target tracking in complex driving scenarios are solved, and efficient target tracking in changing environments is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- JIANGSU UNIV OF TECH
- Filing Date
- 2025-03-17
- Publication Date
- 2026-06-23
AI Technical Summary
The existing technology addresses the specific problem of how to efficiently fuse RGB and event data in complex driving environments and extract meaningful information from them to improve the stability and adaptability of target tracking, especially in complex scenarios such as multiple targets, fast movement, and occlusion.
The MDNet multi-domain learning network is used, combined with event stream data generated by RGB images and event cameras. Feature extraction and fusion are performed through the ASIE cross-spatial information extraction module and the DMSTF dual-modal spatiotemporal fusion module. An adaptive template update mechanism is added to the prediction head to achieve bounding box prediction of the target object.
It significantly improves the performance and accuracy of target tracking, especially in low light, strong light, or dynamic scenes. It can adaptively focus on the target area, enhance image details, improve tracking results, and improve usability and stability in complex driving scenarios.
Smart Images

Figure CN120260009B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to an RGB-E-based target tracking method for complex driving scenarios. Background Technology
[0002] With the rapid development of autonomous driving technology, target tracking in complex driving scenarios has become one of the key technologies in intelligent driving systems. During autonomous driving, vehicles need to perceive and track surrounding objects in real time within complex environments, including other vehicles, pedestrians, and traffic signs. However, due to the variability of the environment, such as changes in lighting, weather conditions, and obstructions, traditional target tracking methods often struggle to reliably handle these complex situations.
[0003] Most existing target tracking methods rely on a single perception modality, such as using only RGB cameras for image processing. While RGB images can provide rich color information and detail, they perform poorly in low-light, bright-light, or dynamic scenes, especially in complex driving scenarios such as nighttime or inclement weather, where the quality and usability of RGB images deteriorate significantly.
[0004] To overcome these challenges, researchers began experimenting with combining multimodal information, including traditional RGB images with event stream data generated by event cameras (E). Event cameras, with their high temporal resolution and sensitivity to fast motion, offer significant advantages in complex scenes characterized by low light, dynamic changes, and rapid movement.
[0005] The RGB-E fusion target tracking method can fully leverage the complementarity of the two perception modalities, and improve the target detection and tracking capabilities in complex driving environments by combining the color information of RGB images and the temporal information of event cameras.
[0006] However, despite the good results achieved by RGB-E fusion methods in some applications, existing technologies still face many challenges. First, efficiently fusing RGB data with event data and extracting meaningful information remains a pressing problem. Second, targets in complex driving scenarios are often affected by multiple dynamic factors, including other vehicles, pedestrians, and complex road conditions, making target tracking more complex in situations involving multiple targets, rapid movement, and occlusion. Therefore, RGB-E-based target tracking methods still need further optimization to adapt to the changing conditions in complex driving scenarios and improve their usability and stability in practical applications. Summary of the Invention
[0007] This invention provides an RGB-E-based target tracking method for complex driving scenarios to address the problems existing in the prior art.
[0008] The technical solutions adopted in this invention are as follows:
[0009] A target tracking method based on RGB-E in complex driving scenarios, comprising the following steps:
[0010] Real-time traffic video was collected and converted into two modalities: RGB image sequence and event sequence, to construct an RGB-E sequence dataset;
[0011] An MDNet multi-domain learning network is built based on the MDNet framework. The MDNet multi-domain learning network performs initial convolution and feature extraction on RGB image sequences and event sequences respectively, then performs feature fusion on RGB image sequences and event sequences, and finally inputs them into the prediction head. The prediction head obtains the bounding box of the target object and outputs the tracking result of the current frame.
[0012] Furthermore, the MDNet multi-domain learning network includes a feature extraction network, a feature fusion network, and a prediction head. The feature extraction network performs initial convolution and feature extraction on the RGB image sequence and the event sequence in two channels respectively. Then, the feature fusion network performs feature merging and feature fusion operations. Finally, the fused features of the RGB image sequence and the event sequence after feature fusion are input into the prediction head.
[0013] Furthermore, residual connections are added after each convolutional layer in the feature extraction network.
[0014] Furthermore, after the initial convolution operation of each channel in the feature extraction network, the designed ASIE cross-spatial information extraction module extracts the sub-features obtained after the initial convolution operation of each channel. The ASIE cross-spatial information extraction module is an improvement based on the EMA attention mechanism. The implementation process of the ASIE cross-spatial information extraction module is as follows:
[0015] 1) Multi-scale features According to the set training round B, it is divided into multiple sub-features X. i =[X0,X1,...,X B-1 ];
[0016] 2) Employing multi-scale convolution and attention mechanisms, average pooling is performed along the horizontal and vertical directions in space to adaptively capture local and global features, cross-scale information, and cross-modal information, while preserving long-distance dependencies in the horizontal direction and positional information in the vertical direction. The mathematical function is:
[0017]
[0018] These represent the features after pooling in the horizontal and vertical directions, respectively, and AvgPool represents the average pooling operation.
[0019] Furthermore, the feature fusion network includes a DMSTF bimodal spatiotemporal fusion module, which is an improvement based on the MSF module and is implemented through the DMSTF bimodal spatiotemporal fusion module.
[0020] 1) The features extracted by the ASIE cross-spatial information extraction module in the same channel are concatenated with the feature map after the initial convolution operation;
[0021] 2) The features extracted by the ASIE cross-spatial information extraction module in different channels are fused.
[0022] Furthermore, the features extracted by the ASIE cross-spatial information extraction module in the same channel are concatenated with the feature map after the initial convolution operation. The process is as follows:
[0023] 11) After the RGB data and Event data of the two modalities are processed through independent initial convolution operations, the basic features are extracted to obtain the corresponding feature maps F. RGB and F E ;
[0024] 12) Extract local features using the ASIE cross-spatial information extraction module to obtain corresponding feature maps.
[0025] 13) Feature splicing within the same channel:
[0026] Within the RGB channel, F RGB and By splicing along the channel dimension, new splicing features are formed.
[0027]
[0028] Within the Event channel, the same concatenation process is performed to obtain...
[0029] Furthermore, the features extracted by the ASIE cross-spatial information extraction module from different channels are fused. The process is as follows:
[0030] 21) First of all, regarding Perform channel transformation, then use a 1×1 convolution to make the number of channels in both channels the same:
[0031]
[0032] 22) Obtain the weights of each feature using the softmax function, and then perform weighted feature fusion.
[0033]
[0034]
[0035]
[0036] 23) The final fusion feature F fuse As a spatiotemporal feature representation, it is input into the subsequent prediction head.
[0037] Furthermore, a time-weighted adaptive template update mechanism is added to the prediction head. In each frame, the similarity between the template and the target images in the most recent and historical frames is calculated at different scales. The scale factor with the highest similarity is selected as the scale change amount of the current frame, and the template is updated to adapt to the scale change of the target.
[0038] Furthermore, the adaptive template update mechanism is as follows:
[0039] T new =αT old +(1-α)T current
[0040] Among them, T new T represents the updated template frame. old T represents the template of the previous frame. current This represents the current template frame, and α is the update weight. It gradually merges information from past and current frames to make the template change slowly and avoid abrupt changes.
[0041] The present invention has the following beneficial effects:
[0042] When extracting features from RGB images and event images, the ASIE-based cross-spatial information extraction module is used to model global features while preserving local contextual information.
[0043] The DMSTF dual-modal spatiotemporal fusion module proposed in this invention can effectively fuse target features in complex scenes. This method adaptively focuses on the target region, improving target visibility under complex conditions, thereby significantly enhancing the performance and accuracy of target tracking. Without increasing model computation too much, it effectively utilizes contextual information in shallow features to enhance image details and improve the enhancement effect.
[0044] The MDNet network model based on RGB images can achieve high accuracy and success rate in most common scenarios. The bimodal model combined with Event images makes up for its shortcomings in low light, strong light or dynamic scenarios, especially in complex driving scenarios such as night or bad weather. Attached Figure Description
[0045] Figure 1 This is a structural diagram of the MDNet multi-domain learning network.
[0046] Figure 2 This is a structural diagram of the ASIE cross-spatial information extraction module.
[0047] Figure 3 This paper compares the accuracy and success rate of the present invention with those of mainstream trackers on the VisEvent test set.
[0048] Figure 4 This invention compares the accuracy and success rate of mainstream trackers on the COESOT test set.
[0049] Figure 5 This invention compares the visualization performance of the present invention with that of mainstream trackers on the VisEvent and COESOT test sets. Detailed Implementation
[0050] The invention will now be further described with reference to the accompanying drawings.
[0051] like Figure 1 As shown, the present invention provides a target tracking method based on RGB-E in complex driving scenarios, comprising the following steps:
[0052] Step 1: Collect real-time traffic video and convert the real-time video into two modalities: RGB image sequence and event sequence, to construct an RGB-E sequence dataset;
[0053] Step 2: Establish an MDNet multi-domain learning network based on the MDNet framework. The MDNet multi-domain learning network performs initial convolution and feature extraction on the RGB image sequence and event sequence respectively, then performs feature fusion on the RGB image sequence and event sequence, and finally inputs it into the prediction head. The prediction head obtains the bounding box of the target object and outputs the tracking result of the current frame.
[0054] The specific process of step one is as follows:
[0055] 1.1) Use the event camera to capture video to obtain an AEDAT 4 file, including real-world road conditions, vehicles, pedestrians, and other scenes;
[0056] 1.2) Use the AEDAT 4 Python toolkit to preprocess the AEDAT 4 files acquired by the event camera to obtain event image representations;
[0057] 1.3) Divide the event sequence according to the microsecond-level timestamps, and generate an event window for each microsecond-level timestamp. Each event window represents the event sequence within a time interval.
[0058] 1.4) Collect event data for each event window, including trigger timestamp, pixel position information, and direction of light intensity change;
[0059] 1.5) Take the event data of each event window as a sample, construct an event sequence dataset, correspond RGB images and event images one by one, keep them synchronized, and construct an RGB-E sequence dataset.
[0060] In step two, the MDNet multi-domain learning network includes a feature extraction network, a feature fusion network, and a prediction head. The feature extraction network performs initial convolution and feature extraction on the RGB image sequence and the event sequence in two channels respectively. Then, the feature fusion network performs feature merging and feature fusion operations. Finally, the fused features of the RGB image sequence and the event sequence are input into the prediction head.
[0061] Each channel in the feature extraction network comprises three convolutional layers. After each convolutional layer, residual blocks with 96, 256, and 512 channels are added respectively. The output of each residual block equals the sum of the input and the result processed by the convolutional layer, improving gradient flow and mitigating the gradient vanishing problem in deep models. Residual connections enhance the preservation of low-level features and the optimization of high-level features, improving feature representation and making the network more stable in complex scenarios such as varying lighting and occlusion. The calculation for each residual block is as follows:
[0062] y i =x i +ReLU(Conv(x i ))
[0063] x i The input features are Conv(x) i ) represents the convolution operation, and ReLU is the activation function.
[0064] Since traditional CNNs mainly utilize local convolutions, they are difficult to capture long-distance spatial dependencies. After the initial convolution operation of each channel in the feature extraction network, the designed ASIE cross-spatial information extraction module extracts the sub-features obtained after the initial convolution operation of each channel to enhance the feature extraction capability of deep learning models in target tracking tasks.
[0065] The ASIE cross-spatial information extraction module is an improvement on the EMA attention mechanism. The ASIE cross-spatial information extraction module includes a 1×1 Conv layer, a 3×3 convolutional layer, a sigmoid function, a softmax function, and a Matmul function (e.g., ...). Figure 2 ).
[0066] The feature extraction process is consistent for both channels. Taking one channel as an example, the implementation process of the ASIE cross-spatial information extraction module is as follows:
[0067] Step 2.1: Extract the multi-scale features obtained from the third convolutional layer of the feature extraction network. Sub-features X are obtained by grouping features along the channel direction according to training round B. i :
[0068] X i =[X0,X1,...,X B-1 ]
[0069] Step 2.2: The sub-features are sent to the ASIE cross-spatial information extraction module, where average pooling is performed along the horizontal and vertical directions in space to obtain... It not only captures long-distance dependencies in the horizontal direction but also preserves positional relationships in the vertical direction, which helps the network to more accurately locate objects of interest.
[0070]
[0071] These represent the features after pooling in the horizontal and vertical directions, respectively, and AvgPool represents the average pooling operation.
[0072] Step 2.3, will By piecing them together in a spatial dimension, we obtain... Then, the intermediate feature map F is obtained through a 1×1 Conv layer transformation in the ASIE cross-spatial information extraction module. F encodes features in the horizontal and vertical directions of space:
[0073]
[0074] express The features after spatial dimension concatenation, where F represents the intermediate feature mapping;
[0075] Step 2.4: The intermediate feature map F is processed by the sigmoid function and normalization to obtain X. c The sigmoid function compresses eigenvalues to (0, 1), ensuring that the eigenvalues are neither too large nor too small. Normalization can further standardize the data, making it fluctuate within a certain range and avoiding numerical instability.
[0076] Step 2.5, X c The original sub-features X, after passing through a 3×3 convolutional layer, are combined with softmax and average pooling, and finally merged into a new feature F using the Matmul function. new The Matmul function improves feature interaction and enhances the relevance of global information.
[0077] X c=sigmoid(F)
[0078] F new =Concat(X,X) c )
[0079] X c F represents the feature after applying the sigmoid function and normalization. new X represents c The new feature obtained by concatenating the initial sub-features.
[0080] The feature fusion network in this invention includes a DMSTF bimodal spatiotemporal fusion module, which is an improvement on the MSF module and is implemented through the DMSTF bimodal spatiotemporal fusion module.
[0081] 1) The features extracted by the ASIE cross-spatial information extraction module in the same channel are concatenated with the feature map after the initial convolution operation. Through concatenation, the model can simultaneously utilize local detailed information and global macroscopic information, enabling it to learn more robust representations and improve its ability to detect and classify targets at different scales. Subsequent layers reuse low-level information, reducing information loss, improving gradient propagation efficiency, enhancing feature representation capabilities, and improving the model's robustness, feature diversity, and ability to distinguish targets from background. This allows the network to maintain a high recognition rate even in complex scenes with different backgrounds, lighting, and viewing angles. The process is as follows:
[0082] 11) After the RGB data and Event data of the two modalities are processed through independent initial convolution operations, the basic features are extracted to obtain the corresponding feature maps F. RGB and F E ;
[0083] 12) Extract local features using the ASIE cross-spatial information extraction module to obtain corresponding feature maps.
[0084] 13) Feature splicing within the same channel:
[0085] Within the RGB channel, F RGB and By splicing along the channel dimension, new splicing features are formed.
[0086]
[0087] Within the Event channel, the same concatenation process is performed to obtain...
[0088] Feature map F after stitching RGB_C and F E_CThe feature is then compressed through a 1×1 convolution in the feature fusion network to reduce redundant information and is normalized to improve feature representation capabilities.
[0089] 2) The features extracted by the ASIE cross-spatial information extraction module in different channels are fused. The process is as follows: Features in a single channel may be missing, resulting in insufficient feature expression ability. Fusion can combine information from different channels, avoid information loss, enhance the complementarity of dual-channel features, learn richer patterns, make features more diversified, alleviate the gradient vanishing problem, improve the generalization ability of the network, reduce the risk of overfitting, and improve the ability to adapt to complex scenarios.
[0090] 21) First of all, regarding Perform channel transformation, then use a 1×1 convolution to make the number of channels in both channels the same:
[0091]
[0092] 22) Obtain the weights of each feature using the softmax function, and then perform weighted feature fusion.
[0093]
[0094]
[0095]
[0096] 23) The final fusion feature F fuse As a spatiotemporal feature representation, it is input into the subsequent prediction head to improve the model's target tracking ability in complex scenarios.
[0097] A time-weighted adaptive template update mechanism is added after the two fully connected layers in the prediction head. The recent appearance of the target is more important, so it is given a time weight, making the most recent frame have a greater influence. This allows the tracker to adjust the template as the target changes, thereby improving tracking performance. The template update mechanism is as follows:
[0098] T new =αT old +(1-α)T current
[0099] Among them, T new T represents the updated template frame. old T represents the template of the previous frame. current This represents the current template frame, and α is the update weight. It gradually merges information from past and current frames to make the template change slowly and avoid abrupt changes.
[0100] The ASIE cross-spatial information extraction module and DMSTF dual-modal spatiotemporal fusion module designed in this invention can more accurately capture changes in various regions of an image when processing complex images, effectively avoiding local enhancement and obtaining clearer, higher-quality features.
[0101] Figure 3 To compare the accuracy and success rate of this invention with mainstream trackers on the VisEvent test set, by Figure 3 It can be seen that the method of the present invention achieves higher accuracy and success rate than existing methods on the large-scale image event single target tracking benchmark dataset VisEvent.
[0102] Figure 4 This invention compares the accuracy and success rate of mainstream trackers on the COESOT test set. Figure 4 It can be seen that the method of the present invention achieves higher accuracy and success rate than existing methods on the large-scale, long-term RGB-E bimodal single-object benchmark dataset COESOT.
[0103] Figure 5 This invention compares the visualization performance of the present invention with mainstream trackers on the VisEvent and COESOT test sets. Figure 5 It is evident that the method of the present invention achieves better results than other mainstream methods in complex scenarios such as low light, fast movement, target occlusion, and out-of-view conditions, and can robustly track targets in most complex scenarios.
[0104] The above description is only a preferred embodiment of the present invention. It should be noted that those skilled in the art can make several improvements without departing from the principle of the present invention, and these improvements should also be considered within the scope of protection of the present invention.
Claims
1. A target tracking method based on RGB-E in complex driving scenarios, characterized in that: The steps are as follows: Real-time traffic video was collected and converted into two modalities: RGB image sequence and event sequence, to construct an RGB-E sequence dataset; An MDNet multi-domain learning network is built based on the MDNet framework. The MDNet multi-domain learning network performs initial convolution and feature extraction on RGB image sequences and event sequences respectively, then performs feature fusion on RGB image sequences and event sequences, and finally inputs them into the prediction head. The prediction head obtains the bounding box of the target object and outputs the tracking result of the current frame. The MDNet multi-domain learning network includes a feature extraction network, a feature fusion network, and a prediction head. The feature extraction network performs initial convolution and feature extraction on the RGB image sequence and the event sequence in two channels respectively. Then, the feature fusion network performs feature merging and feature fusion operations. Finally, the fused features of the RGB image sequence and the event sequence are input into the prediction head. The feature fusion network includes a DMSTF dual-modal spatiotemporal fusion module, which is an improvement on the MSF module and is implemented through the DMSTF dual-modal spatiotemporal fusion module. 1) The features extracted by the ASIE cross-spatial information extraction module in the same channel are concatenated with the feature map after the initial convolution operation; 2) The features extracted by the ASIE cross-spatial information extraction module from different channels are fused; The features extracted by the ASIE cross-spatial information extraction module in the same channel are concatenated with the feature map after the initial convolution operation. The process is as follows: 11) After the RGB data and Event data of the two modalities are processed through independent initial convolution operations, the basic features are extracted to obtain the corresponding feature maps. and ; 12) Extract local features using the ASIE cross-spatial information extraction module to obtain corresponding feature maps. and ; 13) Feature splicing of the same channel: Within the RGB channels, and By splicing along the channel dimension, new splicing features are formed. ; Within the Event channel, the same concatenation process is performed to obtain... ; The features extracted from different channels by the ASIE cross-spatial information extraction module are fused. The process is as follows: 21) First of all, regarding and Perform channel transformation, then use a 1×1 convolution to make the number of channels in both channels the same: , , 22) Through The functions obtain their respective weights, and then the features are weighted and fused. , , , 23) Characteristics of final fusion As a spatiotemporal feature representation, it is input into the subsequent prediction head.
2. The target tracking method based on RGB-E in complex driving scenarios as described in claim 1, characterized in that: Residual connections are added after each convolutional layer in the feature extraction network.
3. The target tracking method based on RGB-E in complex driving scenarios as described in claim 1, characterized in that: After the initial convolution operation of each channel in the feature extraction network, the sub-features obtained after the initial convolution operation of each channel are extracted by the designed ASIE cross-spatial information extraction module. The ASIE cross-spatial information extraction module is an improvement based on the EMA attention mechanism. The implementation process of the ASIE cross-spatial information extraction module is as follows: 1) The multi-scale features obtained from the third convolutional layer in the feature extraction network The feature is divided into multiple sub-features according to the set training round B. ; 2) Employing multi-scale convolution and attention mechanisms, average pooling is performed along the horizontal and vertical spatial directions to adaptively capture local and global features, cross-scale information, and cross-modal information, while preserving long-distance dependencies in the horizontal direction and positional information in the vertical direction. The mathematical function is: , and These represent the features after pooling in the horizontal and vertical directions, respectively. This indicates the average pooling operation.
4. The target tracking method based on RGB-E in complex driving scenarios as described in claim 1, characterized in that: The prediction head incorporates a time-weighted adaptive template update mechanism. In each frame, the similarity between the template and the target images in the most recent and historical frames is calculated at different scales. The scale factor with the highest similarity is selected as the scale change amount for the current frame, and the template is updated to adapt to the scale change of the target.
5. The RGB-E-based target tracking method in complex driving scenarios as described in claim 4, characterized in that: The time-weighted adaptive template update mechanism is as follows: , in, This represents the updated template frame. Indicates the template of the previous frame. Indicates the current template frame. It involves updating the weights and gradually merging information from past and current frames to make the template change slowly and avoid abrupt changes.