Unmanned aerial vehicle navigation feature extraction method based on target perception and multi-branch fusion

By combining visual Transformer and convolutional neural network, and introducing a low-rank multi-head attention mechanism and a lightweight gated convolutional feedforward module, the problems of global modeling and feature fusion in UAV autonomous navigation are solved, thereby improving the autonomous navigation performance and robustness of UAVs in complex environments.

CN122244616APending Publication Date: 2026-06-19CHENGDU UNIV OF INFORMATION TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
CHENGDU UNIV OF INFORMATION TECH
Filing Date
2026-03-25
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

In existing UAV autonomous navigation systems, convolutional neural networks have limited receptive fields and lack global modeling capabilities. The fusion of state information and image features is superficial, and the deployment cost of Transformer structures is high, resulting in insufficient autonomous navigation performance and system robustness in complex environments.

Method used

We adopt a target perception and multi-branch fusion approach, combining a visual Transformer with a convolutional neural network, introducing a low-rank multi-head attention mechanism and a lightweight gated convolutional feedforward module, and using an improved CBAM attention module to enhance feature extraction. We fuse visual and task state information to form a joint feature representation.

Benefits of technology

It significantly improves the autonomous decision-making and navigation performance of UAVs in complex environments, reduces computational complexity, adapts to embedded UAV platforms, enhances the global modeling and target perception capabilities of the model, and strengthens the accuracy and robustness of navigation strategies.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122244616A_ABST
    Figure CN122244616A_ABST
Patent Text Reader

Abstract

This invention discloses a UAV navigation feature extraction method based on target perception and multi-branch fusion, comprising the following steps: acquiring a depth image of the UAV, preprocessing it, and constructing a dual-channel input image; constructing a final input sequence from the depth image through image patch embedding and position encoding, and inputting the final input sequence into a visual Transformer feature extraction branch to obtain global perception features; inputting the dual-channel input image into a convolutional neural network feature extraction branch to obtain local detail features; fusing the global perception features and local detail features to obtain an overall perception feature vector; and inputting the overall perception feature vector into a policy network to establish a navigation action decision model, and outputting navigation actions based on the current perception state. By introducing a visual Transformer structure and a target state guidance token mechanism, the global modeling capability and target perception capability of the feature extraction module are significantly improved.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of UAV navigation technology, specifically relating to a method for extracting UAV navigation features based on target perception and multi-branch fusion. Background Technology

[0002] Existing autonomous navigation systems for unmanned aerial vehicles (UAVs) largely rely on depth images as perception input. A feature extraction network generates a state representation, which is then input into a decision network to generate control actions. Traditional feature extraction methods are typically based on CNN (Convolutional Neural Network) architectures. While these methods perform well in local feature extraction, possessing advantages such as mature structure and high computational efficiency, they suffer from significant shortcomings in global modeling capabilities. This is particularly true when dealing with distant obstacles, sparse targets, or large-scale background scenes, where CNN models struggle to capture dependencies between long-distance regions, impacting the navigation strategy's understanding of the overall scene.

[0003] To compensate for the global perception deficiency of CNNs, some studies have introduced the ViT (Visual Transformer) structure, which uses a multi-head self-attention mechanism to achieve global modeling between image patches. However, existing ViT models still have several problems in their application in UAV navigation tasks: on the one hand, the standard ViT structure consumes a lot of computational resources and is not suitable for real-time deployment on embedded flight control platforms; on the other hand, its structural design usually ignores the guiding role of target-related state information (such as relative position, orientation angle, etc.) in navigation tasks, failing to reflect the perception requirements of navigation tasks, thus limiting the model's task adaptability.

[0004] Furthermore, existing feature extraction methods often employ direct concatenation or simple addition between image branches and state vectors, resulting in insufficient information integration depth. The modulating effect of state information on visual features is not fully utilized, leading to extracted features lacking a clear expression of navigation intent. Additionally, some fusion models have high structural complexity, making them difficult to deploy and run efficiently on resource-constrained platforms.

[0005] In summary, the mainstream methods for feature extraction in existing UAV navigation generally suffer from the following drawbacks: (1) CNNs have limited receptive fields and lack global modeling capabilities; (2) the fusion of state information and image features is shallow, resulting in weak task perception capabilities; (3) Transformer structures have high deployment costs and limited practicality; (4) multi-branch fusion strategies are simple and struggle to balance expressive power and computational efficiency. These problems severely restrict the autonomous navigation performance and system robustness of UAVs in complex and dynamic environments, necessitating the development of an improved feature extraction method that integrates task perception mechanisms, possesses efficient global modeling capabilities, and is easy to deploy. Summary of the Invention

[0006] To address the aforementioned shortcomings in existing technologies, the UAV navigation feature extraction method based on target perception and multi-branch fusion provided by this invention solves the key problems existing in the image feature extraction module of current UAV autonomous navigation tasks, including the limited receptive field of convolutional neural networks, insufficient utilization of target state information, shallow feature fusion methods, and high deployment cost of Transformer structures on resource-constrained platforms.

[0007] To achieve the aforementioned objectives, the present invention employs the following technical solution: a UAV navigation feature extraction method based on target perception and multi-branch fusion, comprising the following steps: S1. Acquire depth images of the UAV, perform preprocessing, and construct a dual-channel input image; S2. Construct the final input sequence from the depth image through image patch embedding and position encoding. Input the final input sequence into the visual Transformer feature extraction branch to obtain global perception features. S3. Input the dual-channel input image into the feature extraction branch of the convolutional neural network to obtain local detail features; S4. Fuse the global perception features with the local detail features to obtain the overall perception feature vector; S5. Input the overall perception feature vector into the policy network, establish a navigation action decision model through end-to-end training, and output navigation actions based on the current perception state.

[0008] Furthermore: S1 includes the following sub-steps: S11. Acquire the depth image of the UAV, standardize it, and obtain a standardized depth grayscale image; In S11, the standardized image scaling formula is used. The specific expression is: In the formula, For the maximum depth range, To obtain the minimum value, For pixels The depth value at that location; S12. Create a state information matrix based on the depth image; S13. The standardized depth grayscale image and the state information matrix are concatenated along the channel dimension to generate a dual-channel input image.

[0009] Furthermore, S2 includes the following sub-steps: S21. Obtain the Patch embedding vector by performing a two-dimensional convolution operation on the depth image, and flatten the Patch embedding vector to obtain the Patch embedding sequence. S22. Add special token information to the Patch embedded sequence to obtain the initial input sequence of the Transformer; S23. Embed the initial input sequence with position encoding to obtain the final input sequence with fused position information; S24. Input the final input sequence into the visual Transformer feature extraction branch to obtain the global perception features output by the visual Transformer feature extraction branch. The visual Transformer feature extraction branch is a multi-layer structure. Each layer includes a first Group Norm layer, a low-rank multi-head attention mechanism layer, a second Group Norm layer, and a lightweight gated convolutional feedforward module connected in sequence. In this process, the input of the first Group Norm layer is used as the input of the second Group Norm layer. The output of the low-rank multi-head attention mechanism layer is connected to the input of the first Group Norm layer through residuals and then input to the second Group Norm layer. The output of the lightweight gated convolutional feedforward module is connected to the input of the second Group Norm layer through residuals to obtain the output of the second Group Norm layer.

[0010] Furthermore: In S22, the special token information includes the category token and the target token; The method for adding a category token to the patch embedding sequence is as follows: extend the category token to the batch dimension and concatenate it at the beginning of the patch embedding sequence to obtain the first sequence form. The specific expression is: In the formula, For the category Token parameter, Embed a sequence for the Patch. For splicing operations; The specific method for adding the target token to the Patch embedded sequence is as follows: The target state vector of the UAV relative to the target Through linear projection layer Convert parameters to the target token Then, the target token is concatenated to the category token to obtain the second sequence form. The specific expression is: .

[0011] Furthermore: S23 specifically refers to: For each token in the initial input sequence of the Transformer, construct a corresponding position encoding vector. Add the initial input sequence of the Transformer to the corresponding position encoding vector element by element to obtain the final input sequence with fused position information.

[0012] Furthermore: In S24, the specific workflow of the low-rank multi-head attention mechanism layer is as follows: A1. Input to the low-rank multi-head attention mechanism layer X The query is mapped through three low-rank linear layers. Q ,key K Sum V ; In the formula, , and Here is the low-rank weight matrix used for each attention head. , For a matrix, D For embedded dimensions, The dimension reduction dimension for each head, For the number of heads; A2. According to the query Q ,key K Sum V Perform independent scaling dot product attention calculations for each attention head, and calculate the attention of each attention head. The specific expression is: In the formula, The Softmax activation function is used. It is the transpose symbol; The outputs of all attention heads are concatenated and fused using a linear mapping to obtain the output of the low-rank multi-head attention mechanism layer. O ; In the formula, For splicing operations, To output the fusion matrix, For the first h The output of each attention head; In S24, the workflow of the lightweight gated convolution feedforward module is as follows: B1. Input tensors of lightweight gated convolutional feedforward modules x After passing through a 1×1 convolutional layer, the projection is a branch tensor with two channels of equal dimension. and These serve as information flow and gated flow, respectively; In the formula, It is a 1×1 convolutional layer; the branch tensors are controlled by a gating mechanism. and Element-wise multiplication yields the gated tensor. ; In the formula, For the Sigmoid function; B2. Perform a depthwise separable convolution operation on the gated tensor, followed by group normalization, to obtain the first tensor. ; In the formula, For depthwise separable convolution operations, Normalize the group; B3. Perform global average pooling on the first tensor to obtain the channel description vector, and then use two... Convolution constructs a compressed and dilated network, and then generates channel weights through Sigmoid activation. The first tensor is multiplied by the channel weights to obtain the second tensor for channel recalibration. B4. The second tensor is added to the input tensor through residual connection to obtain the output of the lightweight gated convolution feedforward module.

[0013] Furthermore: S3 specifically refers to: S31. Input the dual-channel input image into the first convolutional layer and the ReLU activation function layer in sequence to obtain the first feature map; S32. Input the first feature map into the improved CBAM attention module to obtain the attention-weighted feature map; S33. Compress the attention-weighted feature map into a fixed-dimensional vector through global average pooling, and project it onto the same dimension as the global perceptual features using a fully connected layer to obtain local detail features.

[0014] Furthermore: In S32, the workflow of the improved CBAM attention module is as follows: S321. Input the first feature map into the channel attention submodule, compress each channel using global average pooling to generate a channel-level global description vector, and obtain the channel attention weighted result through non-linear mapping of 1×1 convolution and ReLU activation function. S322. Input the first feature map into the spatial attention submodule, perform max pooling and average pooling on the channel dimension to obtain two single-channel feature maps, concatenate the two single-channel feature maps and then perform a 3×3 convolution to obtain the spatial attention weighted result; wherein, the spatial attention submodule sets the spatial weight function according to the difference between the neighborhood mean and the global minimum in the feature map. This is used to improve the attention response capability of the feature extraction branch of a convolutional neural network when it is near an obstacle region; In the formula, The average depth of the neighborhood of each pixel. The minimum global depth It is an adjustable hyperparameter; S323. Multiply the channel attention weighting result and the spatial attention weighting result with the first feature map to obtain the attention-weighted feature map.

[0015] Furthermore, S4 includes the following sub-steps: S41. Extract category token features from global perception features; S42. The category token features are standardized by layer normalization, and then projected onto the same feature dimension as the local detail features using a fully connected layer to obtain global semantic features. S43. A residual fusion strategy is used to weightedly fuse global semantic features and local detail features to obtain fused features; S44. The fused features are concatenated with the target state vector at the feature level to obtain the overall perceptual feature vector, which is used for action decision-making or value function estimation.

[0016] The beneficial effects of this invention are as follows: By introducing a visual Transformer structure and a target state-guided Token mechanism, this invention significantly improves the global modeling capability and target perception capability of the feature extraction module; by introducing a low-rank multi-head attention mechanism layer and a lightweight gated convolutional feedforward module, it effectively reduces computational complexity and adapts to embedded UAV platforms; by utilizing an improved CBAM attention module to enhance the local structure perception capability of the convolutional neural network feature extraction branch, and by using a multi-branch residual fusion method, it achieves effective collaboration between local and global features; the final joint feature integrates visual and task state information, which can directly serve end-to-end navigation strategy learning, improving the autonomous decision-making and navigation performance of UAVs in complex environments. Compared with the prior art, this invention has the following advantages.

[0017] (1) Fusion feature extraction architecture design: This invention significantly improves the global modeling capability and target perception capability of the feature extraction module by introducing a visual Transformer structure and a target state guidance Token mechanism; it introduces a low-rank multi-head attention mechanism layer and a lightweight gated convolutional feedforward module to effectively reduce computational complexity and adapt to embedded UAV platforms; it uses an improved CBAM attention module to enhance the local structure perception capability of the convolutional neural network feature extraction branch, and achieves effective collaboration between local and global features through multi-branch residual fusion; the final joint feature integrates visual and task state information, which can directly serve end-to-end navigation strategy learning and improve the autonomous decision-making and navigation performance of UAVs in complex environments.

[0018] (2) Introducing a target-guided token to enhance task perception: Addressing the lack of focus on task targets in traditional Transformer models, this invention innovatively introduces a "target-guided token" mechanism. This mechanism encodes the relative distance and orientation angle between the UAV's current position and the target into a target state vector, generating a target token embedded in the Transformer input sequence. This token plays a guiding role in the multi-head attention mechanism, enabling the model to explicitly focus on image regions related to the target navigation task, thereby improving the accuracy and relevance of decision-making.

[0019] (3) Low-rank multi-head attention mechanism for efficient modeling: To adapt to UAV platforms with limited computing resources, this invention introduces a low-rank multi-head attention mechanism layer in the visual Transformer structure. By performing low-rank linear transformations on queries, keys, and values, the complexity of the attention matrix multiplication operation is reduced from the original O(N^2) to an approximately linear level. While ensuring global modeling capabilities, this significantly reduces the computational burden and improves model deployment efficiency.

[0020] (4) Lightweight Gated Convolutional Feedforward Module Replaces Traditional MLP: This invention proposes a lightweight gated convolutional feedforward module that integrates a lightweight convolutional gating mechanism and a channel attention mechanism, suitable for visual feature extraction in depth map-driven UAV navigation tasks. Structurally, this module replaces the feed-forward network (FFN) in the traditional Transformer by introducing a lightweight convolutional gating unit to more efficiently model the local spatial information of the image. Specifically, the module first achieves a linear transformation in the channel dimension through 1×1 convolution, and divides the output features into two parts in the channel dimension. One part is processed by depthwise separable convolution to extract local features, and the other part is used as a gating weight for channel-wise multiplication, achieving efficient nonlinear modeling.

[0021] Building upon this foundation, the module further integrates a channel attention mechanism (Squeeze-and-Excitation, SE). Channel descriptors are obtained by global average pooling of intermediate feature maps, and after nonlinear transformation, channel attention weights are generated. These weighted weights are then used to adjust the feature maps, thereby enhancing the model's responsiveness to key semantic channels. This mechanism is particularly suitable for depth images, improving the model's target perception capability with low computational cost by relying solely on a single-channel input. The entire module design maintains lightweight characteristics, employing residual connections to ensure gradient flow and feature information integrity, making it suitable for deployment on computationally sensitive embedded UAV platforms.

[0022] (5) Convolutional Neural Network Feature Extraction Branch Combining Improved CBAM Attention Module and Obstacle Perception Mechanism: This invention integrates an improved CBAM attention module into the convolutional neural network branch for visual feature extraction. This module consists of a channel attention module and a spatial attention module, which are used to significantly enhance the shallow convolutional feature maps, thereby improving the model's ability to focus on key visual regions and its response strength in complex environments.

[0023] In the channel attention submodule, this invention employs a strategy that integrates global average pooling and global max pooling (GMP) to model the importance of each channel. Specifically, the input feature map is subjected to average pooling and max pooling along its spatial dimensions to obtain two channel descriptors. These descriptors are then nonlinearly modeled and summed using a two-layer fully connected network with shared weights to generate the final channel attention weights. This mechanism effectively integrates the information responses of different channels at different scales, significantly improving the selective activation capability of key semantic channels.

[0024] In the spatial attention submodule, this invention introduces an enhancement mechanism based on depth map characteristics to improve the model's attention response to important regions such as obstacles. Specifically, considering that obstacles in the depth map typically correspond to smaller depth values, to highlight the importance of such regions, the mean depth within the local window of each pixel location is first calculated, and a response enhancement factor is constructed by combining this with the minimum depth value of the entire image. This factor is modeled using an inter-weighting function, allowing regions closer to obstacles to receive greater attention weight, thereby enhancing them in the spatial attention map and significantly improving the model's sensitivity and discrimination ability to potential collision regions.

[0025] The improved CBAM attention module, without significantly increasing computational complexity, fully combines the physical semantic information of the depth map with the hierarchical structure of convolutional features, effectively improving the model's performance in key navigation tasks such as target detection and obstacle avoidance. It has good engineering practicality and deployment adaptability.

[0026] (6) High-dimensional joint feature fusion strategy: After extracting visual features in the visual Transformer feature extraction branch and the convolutional neural network feature extraction branch respectively, this invention concatenates and fuses them with the UAV target state vector to form a joint feature representation. This representation simultaneously possesses three information dimensions: global image context, local image details and attention response region, and navigation target semantics. Finally, the fused features are used as input to the reinforcement learning policy network, significantly improving the policy learning effect.

[0027] (7) Adaptable to various reinforcement learning algorithms: The feature extractor structure designed in this invention has strong versatility and can be seamlessly integrated with mainstream reinforcement learning algorithms (such as SAC, TD3, TQC, etc.). Experimental verification shows that in navigation tasks in multiple simulation environments (such as AirSim), this feature extraction architecture can significantly improve the convergence speed, path stability and obstacle avoidance success rate of the model, and has broad practical value and engineering promotion prospects. Attached Figure Description

[0028] Figure 1 This is a flowchart of the UAV navigation feature extraction method based on target perception and multi-branch fusion according to the present invention.

[0029] Figure 2 A diagram of the Transformer feature encoding structure for incorporating UAV location information.

[0030] Figure 3 A network structure for extracting visual features from two branches.

[0031] Figure 4 This is a structural diagram of a lightweight gated convolution feedforward module.

[0032] Figure 5 To improve the average reward curve of the feature extraction model and CNN model during SAC training.

[0033] Figure 6 To improve the average reward curve of the feature extraction model and CNN model during TD3 training.

[0034] Figure 7 To improve the average reward curve of the feature extraction model and CNN model during TQC training.

[0035] Figure 8This paper compares the navigation paths of the model trained using the SAC algorithm with those of the CNN model in typical obstacle avoidance scenarios.

[0036] Figure 9 This study compares the navigation paths of the model trained using the TD3 method with those of a CNN model in typical obstacle avoidance scenarios.

[0037] Figure 10 This paper compares the navigation paths of the model trained using the TQC algorithm with those of the CNN model in typical obstacle avoidance scenarios. Detailed Implementation

[0038] The specific embodiments of the present invention are described below to enable those skilled in the art to understand the present invention. However, it should be understood that the present invention is not limited to the scope of the specific embodiments. For those skilled in the art, various changes are obvious as long as they are within the spirit and scope of the present invention as defined and determined by the appended claims. All inventions utilizing the concept of the present invention are protected.

[0039] like Figure 1 As shown, in one embodiment of the present invention, the UAV navigation feature extraction method based on target perception and multi-branch fusion includes the following steps: S1. Acquire depth images of the UAV, perform preprocessing, and construct a dual-channel input image; S2. Construct the final input sequence from the depth image through image patch embedding and position encoding. Input the final input sequence into the visual Transformer feature extraction branch to obtain global perception features. S3. Input the dual-channel input image into the feature extraction branch of the convolutional neural network to obtain local detail features; S4. Fuse the global perception features with the local detail features to obtain the overall perception feature vector; S5. Input the overall perception feature vector into the policy network, establish a navigation action decision model through end-to-end training, and output navigation actions based on the current perception state.

[0040] In this embodiment, the depth image of the UAV is in 32-bit floating-point format, and each pixel value represents the spatial distance from the camera to the nearest obstacle. The original image size is then uniformly resampled to a fixed resolution (e.g., 80×100 pixels) to meet the input requirements of the subsequent neural network structure.

[0041] S1 includes the following steps: S11. Acquire the depth image of the UAV, standardize it, and obtain a standardized depth grayscale image; In S11, the standardized image scaling formula is used. The specific expression is: In the formula, For the maximum depth range, To obtain the minimum value, For pixels The depth value at the location is used; the inverse mapping method of "255 minus the depth value" is adopted to make the obstacle area appear brighter in the image, which helps to enhance the local attention response capability of the neural network and improve the sensitivity of obstacle recognition.

[0042] In this embodiment, all pixel values ​​in the depth image are limited to Range, and linearly mapped to The grayscale range achieves the effect of brightening at close range and darkening at a distance.

[0043] S12. Create a state information matrix based on the depth image; In S12, the created state information matrix has the same size as the depth image and is used to embed the flight state information of the current navigation task. It mainly includes the following three physical quantities: horizontal distance from the target point, vertical height difference, and yaw angle error.

[0044] S13. The standardized depth grayscale image and the state information matrix are concatenated along the channel dimension to generate a dual-channel input image.

[0045] In the dual-channel input image, the first channel is a grayscale depth image, providing scene spatial information; the second channel is embedded state information, providing navigation task guidance. Through the dual-channel design, spatial perception and target perception are integrated into a unified visual input, providing target guidance support for attention-based neural networks and improving the learning efficiency and generalization ability of reinforcement learning policy networks in navigation tasks.

[0046] like Figures 2-3 As shown, S2 includes the following sub-steps: S21. Obtain the Patch embedding vector by performing a two-dimensional convolution operation on the depth image, and flatten the Patch embedding vector to obtain the Patch embedding sequence. S22. Add special token information to the Patch embedded sequence to obtain the initial input sequence of the Transformer; S23. Embed the initial input sequence with position encoding to obtain the final input sequence with fused position information; S24. Input the final input sequence into the visual Transformer feature extraction branch to obtain the global perception features output by the visual Transformer feature extraction branch. The visual Transformer feature extraction branch has a multi-layer structure. Each layer includes a first Group Norm layer, a low-rank multi-head self-attention (LR-MHSA) layer, a second Group Norm layer, and a lightweight gated convolutional feedforward module (CGLU-SE) connected in sequence. In this multi-layered structure, the input of the first Group Norm layer serves as the input to the second Group Norm layer. The output of the low-rank multi-head attention mechanism layer is concatenated with the input of the first Group Norm layer via residuals, and then fed into the second Group Norm layer. The output of the lightweight gated convolutional feedforward module is concatenated with the input of the second Group Norm layer via residuals to obtain the output of the second Group Norm layer. The structure comprises L×2 layers, with the input of each layer concatenated with the output of the layer above it. The output of the last layer is the globally perceptual feature, and the input of the first layer is the final input sequence.

[0047] like Figure 2 As shown, in S21, the input single-channel depth image Image segmentation and linear projection are achieved through a two-dimensional convolution operation, Conv2d. For a matrix, W For width, H For height, B For batch processing; set the kernel size k = P × P and the stride... ,in, The set patch size, for example, 16 pixels. The number of channels in the convolution kernel is C=D, meaning each patch is mapped to a D-dimensional vector (e.g., 128-dimensional), forming... Each Patch embedding vector.

[0048] In the specific implementation, embedding is accomplished through the following transformations: In the formula, Embedded vectors for Patch ; Flattening the two-dimensional spatial structure into a sequence form yields the Patch embedding sequence. ; In the formula, D represents the embedding dimension. In this way, the depth image is transformed into a sequence of fixed-length vectors, suitable for subsequent Transformer structure processing. Two-dimensional spatial image information is converted into a one-dimensional token sequence, and a convolutional embedding unit learned end-to-end automatically extracts local spatial features while maintaining a consistent dimensionality output, providing a foundation for modeling global image dependencies using subsequent attention mechanisms.

[0049] In S22, to enhance the semantic representation and target guidance capabilities of the model, special token information is added before the Patch embedding sequence. The special token information includes the category token (CLS token) and the target token (Target token). The method for adding a category token to the patch embedding sequence is as follows: extend the category token to the batch dimension and concatenate it at the beginning of the patch embedding sequence to obtain the first sequence form. The specific expression is: In the formula, For the category Token parameter, Embed a sequence for the Patch. For splicing operations; In this embodiment, the target token is learned through backpropagation during training and is used to aggregate global feature information of the entire image. The model ultimately uses the output corresponding to this token as the image representation vector and inputs it into the downstream decision network.

[0050] The specific method for adding the target token to the Patch embedded sequence is as follows: The target state vector of the UAV relative to the target Through linear projection layer Convert parameters to the target token The target state vector includes horizontal distance, vertical distance, and yaw angle difference. The target token is then concatenated with the category token to obtain the second sequence form. The specific expression is: .

[0051] In this embodiment, by introducing category tokens, the model can extract global features based on the entire sequence as a unified representation of the image; by embedding target state information into the target guidance token, the Transformer has the ability to "aware the target" when performing attention calculations, guiding visual attention to focus on the area related to the navigation target, thereby significantly improving the learning efficiency and convergence speed of the navigation strategy.

[0052] S23 specifically refers to: For each token in the initial input sequence of the Transformer, construct a corresponding position encoding vector. Add the initial input sequence of the Transformer to the corresponding position encoding vector element by element to obtain the final input sequence with fused position information.

[0053] Since the Transformer structure itself does not have the ability to process the positional information of elements in a sequence, a "positional encoding" mechanism must be introduced to enable the model to perceive the arrangement order of input features in the original space. In this embodiment, in order for the visual Transformer to effectively model the spatial structural relationship between each patch of the image, a learnable positional embedding method is adopted. The position of each image patch is explicitly encoded into a trainable vector, which is then added element-wise to its corresponding patch embedding to achieve spatial information injection.

[0054] Let the depth image size be Patch size is The image is then divided into: There are 1 patch. After adding 1 class token and 1 target token, the total input sequence length of the Transformer is 1. .

[0055] For each token in the sequence, construct a corresponding position encoding vector, denoted as . The positional encoding vector serves as the model parameter, which is automatically learned during training without requiring manual design of the function form.

[0056] The initial input sequence of the Transformer Z The corresponding location encoding vector is added element by element to obtain the final input sequence of fused location information. .

[0057] The addition operation is an element-wise broadcasting. This step introduces a learnable positional encoding vector, enabling the Transformer model to clearly distinguish the spatial relationships of each patch in the image, thereby better learning the structural features and target distribution of the image. Compared to fixed sinusoidal function encoding, learned positional encoding can adaptively adapt to different resolutions and task characteristics, significantly improving the model's sensitivity to spatial structure and generalization ability.

[0058] In S24, a low-rank multi-head attention mechanism layer is used to replace the traditional high-complexity fully connected attention computation module, thereby reducing the amount of computation and improving deployment efficiency. It is particularly suitable for edge computing platforms and embedded drone systems. The specific workflow of the low-rank multi-head attention mechanism layer is as follows: A1. Input to the low-rank multi-head attention mechanism layer X The query is mapped through three low-rank linear layers. Q ,key K Sum V ; In the formula, , and Here is the low-rank weight matrix used for each attention head. , The reduced dimension (rank) for each head. The number of attention heads; each attention head corresponds to a rank of . The subspace, thus reducing the complexity from the traditional Reduced to This significantly reduces computing costs.

[0059] A2. According to the query Q ,key K Sum V Independent ScaledDot-Product Attention is computed for each attention head, and the attention of each attention head is calculated. The specific expression is: In the formula, The Softmax activation function is used. It is the transpose symbol; The outputs of all attention heads are concatenated and fused using a linear mapping to obtain the output of the low-rank multi-head attention mechanism layer. O ; In the formula, For splicing operations, To output the fusion matrix, For the first hThe low-rank multi-head attention mechanism layer of this invention significantly reduces the number of parameters and computational complexity compared to traditional multi-head attention mechanisms, making it particularly suitable for image embedding scenarios with long input sequences (such as dividing a depth map into multiple patches). Furthermore, by reducing the ranks of Q, K, and V, storage footprint and memory access are also reduced, enabling lightweight deployment without significant loss of expressive power, thus providing faster response times and longer endurance for UAV systems.

[0060] like Figure 4 As shown, to further enhance the model's nonlinear expressive power and local feature extraction capabilities, a lightweight gated convolutional feedforward module is introduced after the low-rank multi-head attention mechanism layer in each layer to replace the fully connected MLP structure in the traditional Transformer. This module integrates gating mechanisms, depthwise convolution (DVC), channel recalibration mechanisms (SE Attention), and residual connection strategies, aiming to improve the model's feature selectivity and channel interaction capabilities while maintaining low computational complexity.

[0061] In S24, the workflow of the lightweight gated convolution feedforward module is as follows: B1. Input tensors of lightweight gated convolutional feedforward modules x After passing through a 1×1 convolutional layer, the projection is a branch tensor with two channels of equal dimension. and These serve as information flow and gated flow, respectively; In the formula, It is a 1×1 convolutional layer; the branch tensors are controlled by a gating mechanism. and Element-wise multiplication yields the gated tensor. To achieve selective control of information; In the formula, For the Sigmoid function; B2. Perform a depthwise separable convolution operation on the gated tensor to capture local spatial features while significantly reducing the number of parameters and computational cost. Then perform group normalization to obtain the first tensor. ; In the formula, For depthwise separable convolution operations, Normalize the group; B3. To enhance modeling capabilities at the channel level, the SE module is introduced to dynamically adjust the response level of each channel, specifically as follows: Global average pooling is performed on the first tensor to obtain the channel description vector, which is then processed through two... Convolutional layers construct a compressed and dilated network, which is then activated by a Sigmoid function to generate channel weights. Multiply the first tensor by the channel weights to obtain the second tensor for channel recalibration. ; B4. Add the second tensor to the input tensor through residual connection to obtain the output of the lightweight gated convolution feedforward module. .

[0062] In this embodiment, to prevent information loss and gradient vanishing problems, a residual connection is used to add the second tensor to the input tensor, and the output is... While retaining the original information flow, it integrates the local features of spatial-channel attention guidance, thus possessing stronger expressive and perceptual capabilities.

[0063] The lightweight gated convolutional feedforward module integrates a gating mechanism to improve feature selectivity, a deep convolutional structure to reduce complexity, channel recalibration for dynamic channel weighting, and residual connections to prevent gradient vanishing. Compared with the feedforward MLP structure in the standard Transformer, the lightweight gated convolutional feedforward module proposed in this invention is more suitable for processing image patches or feature maps, especially for fast response and fine modeling of local target regions in UAV vision tasks. While maintaining a compact structure, it effectively improves the deployability and response speed of the model on embedded devices (such as UAV platforms).

[0064] like Figure 3 As shown, in this embodiment, the convolutional neural network feature extraction branch and the visual Transformer feature extraction branch are set to run in parallel in the dual-branch visual feature extraction network structure design. This fully utilizes the advantages of convolutional networks in local feature capture and spatial invariant modeling, thereby enhancing the overall model's ability to perceive environmental details and improving the accuracy and robustness of the navigation strategy.

[0065] S3 specifically refers to: S31. Input the dual-channel input image into the first convolutional layer and the ReLU activation function layer in sequence to obtain the first feature map; S32. Input the first feature map into the improved CBAM attention module to obtain the attention-weighted feature map; S33. Compress the attention-weighted feature map into a fixed-dimensional vector through global average pooling, and project it onto the same dimension as the global perceptual features using a fully connected layer to obtain local detail features.

[0066] In this embodiment, the feature extraction branch of the convolutional neural network adopts a shallow convolutional structure to quickly extract low-level visual patterns such as edge information and texture gradients from the input depth image.

[0067] In S32, to further improve the CNN branch's response to key regions and important channels, this invention introduces an improved CBAM attention module adapted to the depth map. The specific workflow of the improved CBAM attention module is as follows: S321. Input the first feature map into the channel attention submodule, compress each channel using global average pooling to generate a channel-level global description vector, and obtain the channel attention weighted result through non-linear mapping of 1×1 convolution and ReLU activation function. In this embodiment, the channel attention submodule can suppress information redundancy and select channel enhancements, making it particularly suitable for situations where channel responses are uneven in multi-scale depth maps.

[0068] S322. Input the first feature map into the spatial attention submodule, perform max pooling and average pooling on the channel dimension to obtain two single-channel feature maps, concatenate the two single-channel feature maps and then perform a 3×3 convolution to obtain the spatial attention weighted result; wherein, the spatial attention submodule sets the spatial weight function according to the difference between the neighborhood mean and the global minimum in the feature map. This is used to improve the attention response capability of the feature extraction branch of a convolutional neural network when it is near an obstacle region; In the formula, The average depth of the neighborhood of each pixel. The minimum global depth It is an adjustable hyperparameter; In this embodiment, the spatial attention submodule can significantly improve the attention response near the obstacle area, making the model more discriminative in obstacle avoidance tasks.

[0069] S323, Weight the channel attention results Spatial attention weighted results With the first feature map Multiplying them together yields the attention-weighted feature map. .

[0070] In the formula, ⊙ represents element-wise multiplication.

[0071] In S33, the feature map with attention weighting is completed. Compressed into a fixed-dimensional vector through global average pooling (GAP) operation. : Then, a fully connected layer is used to project it to the same dimension as the Transformer branch: In the formula, For local detail features, For normalization layer, It is a fully connected layer. d To achieve target dimensions, such as 28, 64, or 128, the technology offers several advantages: Strong local sensitivity: Convolutional kernels enhance the model's perception of texture edges and obstacle boundaries by responding to local regions; Enhanced spatial attention: Spatial attention improves the focus on key obstacle regions, and channel attention optimizes feature utilization; Fusion-friendly: The output and ViT branches share the same dimension, allowing for direct residual fusion and stitching; Lightweight and efficient: Low structural parameter count, suitable for embedded platform deployment.

[0072] In this embodiment, in order to obtain a global feature representation representing the entire input depth map, the CLS Token output extraction and fusion mechanism is adopted to perform residual fusion of the global perception features extracted by the visual Transformer feature extraction branch and the local detail features extracted by the convolutional neural network feature extraction branch, and finally splice the target state information to construct the overall perception feature vector.

[0073] S4 includes the following steps: S41. Extract category token features from global perception features; In this embodiment, a pre-defined category token is used to aggregate global context information before the input sequence. After passing through several Transformer encoder layers, this token maintains its first position in the sequence output. S42. The category token features are standardized by layer normalization (LayerNorm), and then projected onto the same feature dimension as the local detail features using a fully connected layer (Linear) to obtain global semantic features; S43. A residual fusion strategy is used to weightedly fuse global semantic features and local detail features to obtain fused features; In this embodiment, the residual connection of the residual fusion strategy can alleviate the difficulty of model convergence, while fusing global semantic features and local detail features to improve the perception ability of complex obstacle scenes.

[0074] S44. The fused features are concatenated with the target state vector at the feature level to obtain the overall perceptual feature vector, which is used for action decision-making or value function estimation.

[0075] In S5, the overall perception feature vector serves as the state input, which is fed into the policy network constructed using a reinforcement learning algorithm for training or inference. The policy network, through end-to-end training, learns how to perform navigation actions based on the current perception state, such as forward flight, turning, ascent / descent, etc., thereby achieving autonomous obstacle avoidance, target arrival, path optimization, and other flight tasks. This feature extraction scheme is compatible with mainstream reinforcement learning frameworks and is suitable for deployment on simulators (such as AirSim and Gazebo) and real drone platforms.

[0076] In this embodiment, to verify the effectiveness of the proposed feature extraction network, two model structures were designed and compared: one is an improved model integrating a low-rank multi-head attention mechanism, and the other is a version that replaces the attention mechanism with standard multi-head attention. The two differ mainly in the attention calculation method of the visual Transformer, while maintaining the same overall structure. Through comparative analysis, the aim is to evaluate the contribution of the low-rank attention mechanism to parameter compression while maintaining model performance.

[0077] Table 1 Comparison of Model Parameter Quantities Experimental results show that the improved model employing a low-rank multi-head attention mechanism significantly outperforms the standard multi-head structure in terms of overall parameter count. Specifically, the improved model has 395,123 parameters, while the standard multi-head model has 461,043, representing a reduction of approximately 14.3%. This is mainly attributed to the use of a lower-dimensional projection space in each attention head of the visual Transformer, effectively reducing parameter redundancy during attention computation by compressing the dimensions of the Q, K, and V vectors. While maintaining the same embedding dimension, this method significantly improves the model's computational efficiency and deployment friendliness, making it particularly suitable for computationally limited UAV platforms.

[0078] From the perspective of module parameter distribution, the ViT backbone accounts for over 99.6% of the parameters in both models, while the branches of the CNN+ improved CBAM attention module are relatively lightweight, accounting for only about 0.4%. This indicates that the core computational load of the model is concentrated in the visual Transformer structure, and the branches of the improved CBAM attention module mainly play a role in supplementing local details and guiding attention, thereby enhancing the response capability to obstacle regions. This structural design balances global modeling capabilities with local discrimination capabilities, forming an image perception feature system suitable for complex scenes.

[0079] In this experiment, the navigation performance of three feature extraction models (including this model, the traditional CNN model, and the SAC basic model) under different reinforcement learning algorithms (SAC, TD3, and TQC) was compared. The training results are shown in the figure below. Figures 5 to 7 The test results are shown in Table 2. The main evaluation indicators include reward value, success rate, and path length efficiency (SPL). The experimental results show that the model performs well on multiple indicators, demonstrating good overall performance.

[0080] Table 2 Comparison of Model Effects in Experiments In terms of reward value, the model of this invention, paired with the TQC algorithm, achieved the highest score of 27.46, significantly outperforming other combinations, indicating that the model can obtain more positive feedback during execution. Among all models, the rewards of traditional CNN models are generally low, especially under the TD3 algorithm, which only reaches 15.33, reflecting their weak ability to extract effective features in complex environments and poor decision quality.

[0081] In terms of success rate, all models achieved a task completion rate of over 96% under different algorithms, demonstrating the strong robustness of reinforcement learning strategies in basic path planning tasks. Among them, this model achieved the highest success rate (0.996) under the TD3 algorithm, indicating that the model not only perceives target and obstacle information but also possesses stronger path planning and control capabilities.

[0082] In the SPL (Success weighted by Path Length) metric, the CNN model achieved the highest efficiency, particularly reaching 0.9536 under the TD3 algorithm. This indicates that its path is shorter and closer to the optimal path. However, combined with its lower reward value, it can be inferred that although the path is short, the decision-making process may lack strategy or be slightly aggressive. Our model, on the other hand, achieved an SPL of 0.9264 under TD3. While slightly lower than the CNN, it still achieved superior path efficiency while maintaining a high success rate and a relatively high reward, demonstrating a good balance.

[0083] In such Figures 8 to 10 In the CNN model, when the target point is near behind an obstacle, the SAC and TD3 algorithms cannot reach the target point. In the TQC algorithm, both models can reach the target point effectively, but the improved model has a shorter route.

[0084] In summary, this model effectively enhances the modeling ability of target and environment features by introducing a lightweight visual Transformer feature extraction branch structure, an improved CBAM attention module, and a spatial attention mechanism, achieving a good balance between reward value, success rate, and path efficiency. Compared with traditional CNN models, this model has stronger generalization ability and policy stability in complex environments, validating the advantages of its structural design.

[0085] In the description of this invention, it should be understood that the terms "center," "thickness," "upper," "lower," "horizontal," "top," "bottom," "inner," "outer," and "radial," etc., indicating orientation or positional relationships based on the orientation or positional relationships shown in the accompanying drawings, are only for the convenience of describing the invention and simplifying the description, and do not indicate or imply that the device or element referred to must have a specific orientation, or be constructed and operated in a specific orientation, and therefore should not be construed as a limitation of the invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and should not be construed as indicating or implying the relative importance or the number of technical features implicitly specified. Therefore, a feature defined by "first," "second," and "third" may explicitly or implicitly include one or more of that feature.

Claims

1. A method for unmanned aerial vehicle navigation feature extraction based on target perception and multi-branch fusion, characterized in that, Includes the following steps: S1. Acquire depth images of the UAV, perform preprocessing, and construct a dual-channel input image; S2. Construct the final input sequence from the depth image through image patch embedding and position encoding. Input the final input sequence into the visual Transformer feature extraction branch to obtain global perception features. S3. Input the dual-channel input image into the feature extraction branch of the convolutional neural network to obtain local detail features; S4. Fuse the global perception features with the local detail features to obtain the overall perception feature vector; S5. Input the overall perception feature vector into the policy network, establish a navigation action decision model through end-to-end training, and output navigation actions based on the current perception state. 2.The UAV navigation feature extraction method based on target perception and multi-branch fusion according to claim 1, characterized in that, S1 includes the following steps: S11. Acquire the depth image of the UAV, standardize it, and obtain a standardized depth grayscale image; In S11, the standardized image scaling formula is used. The specific expression is: In the formula, For the maximum depth range, To obtain the minimum value, For pixels The depth value at that location; S12. Create a state information matrix based on the depth image; S13. The standardized depth grayscale image and the state information matrix are concatenated along the channel dimension to generate a dual-channel input image.

3. The UAV navigation feature extraction method based on target perception and multi-branch fusion according to claim 2, characterized in that, S2 includes the following steps: S21. Obtain the Patch embedding vector by performing a two-dimensional convolution operation on the depth image, and flatten the Patch embedding vector to obtain the Patch embedding sequence. S22. Add special token information to the Patch embedded sequence to obtain the initial input sequence of the Transformer; S23. Embed the initial input sequence with position encoding to obtain the final input sequence with fused position information; S24. Input the final input sequence into the visual Transformer feature extraction branch to obtain the global perception features output by the visual Transformer feature extraction branch. The visual Transformer feature extraction branch is a multi-layer structure. Each layer includes a first Group Norm layer, a low-rank multi-head attention mechanism layer, a second Group Norm layer, and a lightweight gated convolutional feedforward module connected in sequence. In this process, the input of the first Group Norm layer is used as the input of the second Group Norm layer. The output of the low-rank multi-head attention mechanism layer is connected to the input of the first Group Norm layer through residuals and then input to the second Group Norm layer. The output of the lightweight gated convolutional feedforward module is connected to the input of the second Group Norm layer through residuals to obtain the output of the second Group Norm layer.

4. The UAV navigation feature extraction method based on target perception and multi-branch fusion according to claim 3, characterized in that, In S22, the special token information includes the category token and the target token; The method for adding a category token to the patch embedding sequence is as follows: extend the category token to the batch dimension and concatenate it at the beginning of the patch embedding sequence to obtain the first sequence form. The specific expression is: In the formula, For the category Token parameter, Embed a sequence for the Patch. For splicing operations; The specific method for adding the target token to the Patch embedded sequence is as follows: The target state vector of the UAV relative to the target Through linear projection layer Convert parameters to the target token Then, the target token is concatenated to the category token to obtain the second sequence form. The specific expression is: 。 5. The UAV navigation feature extraction method based on target perception and multi-branch fusion according to claim 3, characterized in that, S23 specifically refers to: For each token in the initial input sequence of the Transformer, construct a corresponding position encoding vector. Add the initial input sequence of the Transformer to the corresponding position encoding vector element by element to obtain the final input sequence with fused position information.

6. The UAV navigation feature extraction method based on target perception and multi-branch fusion according to claim 3, characterized in that, In S24, the specific workflow of the low-rank multi-head attention mechanism layer is as follows: A1. Input to the low-rank multi-head attention mechanism layer X The query is mapped through three low-rank linear layers. Q ,key K Sum V ; In the formula, , and Here is the low-rank weight matrix used for each attention head. , For a matrix, D For embedded dimensions, The dimension reduction dimension for each head, For the number of heads; A2. According to the query Q ,key K Sum V Perform independent scaling dot product attention calculations for each attention head, and calculate the attention of each attention head. The specific expression is: In the formula, The Softmax activation function is used. It is the transpose symbol; The outputs of all attention heads are concatenated and fused using a linear mapping to obtain the output of the low-rank multi-head attention mechanism layer. O ; In the formula, For splicing operations, To output the fusion matrix, For the first h The output of each attention head; In S24, the workflow of the lightweight gated convolution feedforward module is as follows: B1. Input tensors of lightweight gated convolutional feedforward modules x After passing through a 1×1 convolutional layer, the projection is a branch tensor with two channels of equal dimension. and These serve as information flow and gated flow, respectively; In the formula, It is a 1×1 convolutional layer; the branch tensors are controlled by a gating mechanism. and Element-wise multiplication yields the gated tensor. ; In the formula, For the Sigmoid function; B2. Perform a depthwise separable convolution operation on the gated tensor, followed by group normalization, to obtain the first tensor. ; In the formula, For depthwise separable convolution operations, Normalize the group; B3. Perform global average pooling on the first tensor to obtain the channel description vector, and then use two... Convolution constructs a compressed and dilated network, and then generates channel weights through Sigmoid activation. The first tensor is multiplied by the channel weights to obtain the second tensor for channel recalibration. B4. The second tensor is added to the input tensor through residual connection to obtain the output of the lightweight gated convolution feedforward module.

7. The UAV navigation feature extraction method based on target perception and multi-branch fusion according to claim 1, characterized in that, S3 specifically refers to: S31. Input the dual-channel input image into the first convolutional layer and the ReLU activation function layer in sequence to obtain the first feature map; S32. Input the first feature map into the improved CBAM attention module to obtain the attention-weighted feature map; S33. Compress the attention-weighted feature map into a fixed-dimensional vector through global average pooling, and project it onto the same dimension as the global perceptual features using a fully connected layer to obtain local detail features.

8. The UAV navigation feature extraction method based on target perception and multi-branch fusion according to claim 7, characterized in that, In S32, the workflow of the improved CBAM attention module is as follows: S321. Input the first feature map into the channel attention submodule, compress each channel using global average pooling to generate a channel-level global description vector, and obtain the channel attention weighted result through non-linear mapping of 1×1 convolution and ReLU activation function. S322. Input the first feature map into the spatial attention submodule, perform max pooling and average pooling on the channel dimension to obtain two single-channel feature maps, concatenate the two single-channel feature maps and then perform a 3×3 convolution to obtain the spatial attention weighted result; wherein, the spatial attention submodule sets the spatial weight function according to the difference between the neighborhood mean and the global minimum in the feature map. This is used to improve the attention response capability of the feature extraction branch of a convolutional neural network when it is near an obstacle region; In the formula, The average depth of the neighborhood of each pixel. The minimum global depth It is an adjustable hyperparameter; S323. Multiply the channel attention weighting result and the spatial attention weighting result with the first feature map to obtain the attention-weighted feature map.

9. The UAV navigation feature extraction method based on target perception and multi-branch fusion according to claim 1, characterized in that, S4 includes the following steps: S41. Extract category token features from global perception features; S42. The category token features are standardized by layer normalization, and then projected onto the same feature dimension as the local detail features using a fully connected layer to obtain global semantic features. S43. A residual fusion strategy is used to weightedly fuse global semantic features and local detail features to obtain fused features; S44. The fused features are concatenated with the target state vector at the feature level to obtain the overall perceptual feature vector, which is used for action decision-making or value function estimation.