Unmanned aerial vehicle target tracking method and system based on hawk-eye dual fovea visual mechanism

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By constructing a UAV target tracking method based on the eagle-eye dual-foveal vision mechanism, utilizing a dual-foveal interactive attention network to process fine-grained and coarse-grained information interaction, and combining dynamic and static template updates, the problems of insufficient resolution and limited feature information in UAV target tracking are solved, thereby improving tracking accuracy and robustness.

CN122265892APending Publication Date: 2026-06-23QUFU NORMAL UNIV

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: QUFU NORMAL UNIV
Filing Date: 2026-03-30
Publication Date: 2026-06-23

Application Information

Patent Timeline

30 Mar 2026

Application

23 Jun 2026

Publication

CN122265892A

IPC: G06V20/17; G06V10/25; G06T7/246; G06V10/44; G06V10/82; G06V10/80; G06N3/045; G06N3/0464; G06N3/082

AI Tagging

Application Domain

Image analysis Character and pattern recognition

Technology Topics

Pattern recognition Uncrewed vehicle

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

Smart Images

Figure CN122265892A_ABST

Patent Text Reader

Abstract

This invention discloses a method and system for UAV target tracking based on an eagle-eye dual-foveal vision mechanism, relating to the fields of computer vision and UAV technology. The method includes: acquiring video frames and dividing them into template images and search images, extracting template features and search features; constructing a dual-foveal interactive attention network, capturing coarse-grained and fine-grained feature associations through shallow and deep foveal interactive attention modules respectively, and fusing them to obtain enhanced search features; concatenating the original search features and the enhanced search features and inputting them into a prediction network to obtain target information; and based on a preset update strategy, determining whether to generate dynamic template features and replace the original template features to achieve template update. This invention, by simulating the eagle-eye dual-foveal vision mechanism, takes into account both global semantics and local detail features, effectively improving the accuracy and robustness of UAV target tracking in complex scenarios.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of computer vision and unmanned aerial vehicle (UAV) technology, and in particular to a UAV target tracking method and system based on the eagle-eye dual-foveal vision mechanism. Background Technology

[0002] Visual target tracking is the core technology for UAV target tracking. While general target tracking algorithms can be applied to UAV-based target tracking, factors such as significant changes in the UAV's viewpoint, target deformation, and scale variations lead to reduced tracking accuracy and precision. Furthermore, the limited battery capacity and computing resources of UAVs place stringent demands on the efficiency of UAV tracking algorithms. Therefore, designing UAV tracking algorithms that balance tracking accuracy and efficiency is a challenging task.

[0003] In recent years, the popularity of Transformers has driven the further development of UAV target tracking. Due to its ability to model long-range relationships, Transformer-based UAV tracking algorithms have become mainstream methods. These methods are single-stream, meaning that both the template and search images are fed into a pre-trained visual Transformer for feature extraction and fusion. Stream tracking employs cross-attention operations to explore point-to-point relationships between the template and search images. Recent studies have explored efficient template-search relationships, such as by introducing a token partitioning module to dynamically select appropriate search tokens to interact with the template, or by proposing a strategy to maximize mutual information between the template image and its features. These methods improve the feature fusion performance of single-stream tracking pipelines. While single-stream tracking is concise, it neglects local contextual information and global awareness, making it difficult for the tracker to capture finer details during feature fusion, especially for small targets and partial occlusion in UAV scenarios. Due to insufficient resolution and limited feature information, relying solely on point-to-point correlation can easily lead to tracking failure. Summary of the Invention

[0004] This invention provides a UAV target tracking method and system based on the eagle-eye dual-foveal vision mechanism, aiming to solve the problems of insufficient tracking accuracy and robustness of existing UAV target tracking methods in scenarios with insufficient resolution and limited feature information.

[0005] To achieve the above objectives, the present invention adopts the following technical solution: Firstly, a UAV target tracking method based on the eagle-eye dual-foveal vision mechanism is provided, including: Acquire video images and divide the current video frame into template images and search images; in the first frame of the video, crop out an image centered on the target as a fixed template based on the given initial bounding box of the target; The template image and the search image are input into a feature extraction network to obtain template features and search features respectively. A dual-foveal interactive attention network is constructed, and the template features and search features are input into the dual-foveal interactive attention network to obtain enhanced search features; The search features are concatenated with the enhanced search features and input into the prediction network to obtain the target information in the current frame; Based on a preset update strategy, it is determined whether the current frame needs a template update. If an update is required, a refined foreground mask is generated based on the prediction result of the current frame, and dynamic template features are extracted from the current frame based on the mask. The template features are then replaced with dynamic template features. The dynamic template features and the fixed template are input together into the dual foveal interactive attention network for tracking of the next frame.

[0006] Secondly, a UAV target tracking system based on an eagle-eye dual-foveal vision mechanism is provided, including: The image acquisition module is configured to acquire video images, divide the current video frame into template images and search images; in the first frame of the video, based on the given initial bounding box of the target, crop out an image centered on the target as a fixed template; A feature extraction network module is configured to input the template image and the search image into the feature extraction network to obtain template features and search features accordingly. A dual-foveal interactive attention network module is configured to construct a dual-foveal interactive attention network, inputting the template features and search features into the dual-foveal interactive attention network to obtain enhanced search features; A prediction network module is configured to concatenate the search features with the enhanced search features and input them into the prediction network to obtain information about the target in the current frame; The template update module is configured to determine whether the current frame needs to be updated based on a preset update strategy. If an update is required, a refined foreground mask is generated based on the prediction result of the current frame, and dynamic template features are extracted from the current frame based on the mask. The template features are then replaced with dynamic template features. The dynamic template features and the fixed template are input together into the dual foveal interactive attention network for tracking of the next frame.

[0007] Thirdly, an electronic device is also provided, comprising: Memory, used for non-transitory storage of computer-readable instructions; and Processor, for executing the computer-readable instructions, When the computer-readable instructions are executed by the processor, they perform the method described in the first aspect above.

[0008] Fourthly, a computer-readable storage medium is provided having a program stored thereon that, when executed by a processor, implements the method described in the first aspect above.

[0009] The above technical solution has the following advantages or beneficial effects: (1) This invention constructs a tracking network based on the dual foveal vision mechanism of eagle eye. Through the dual foveal interactive attention module, fine-grained and coarse-grained information interaction are processed respectively, simulating the eagle eye's ability to take into account both details and the global situation, and significantly improving the discrimination ability of targets in complex backgrounds. This invention designs a shallow foveal interactive attention module and a deep foveal interactive attention module to establish the correlation between global coarse-grained and local fine-grained information respectively. Combined with adaptive fusion, it effectively solves the tracking loss problem caused by large target scale changes and local occlusion under the perspective of UAV.

[0010] (2) The present invention adopts a template update strategy that combines static and dynamic elements, which not only retains the reliable information of the initial frame, but also generates a dynamic template to adapt to the real-time changes in the appearance of the target, thereby improving the robustness of long-term tracking. Attached Figure Description

[0011] The accompanying drawings, which form part of this invention, are used to provide a further understanding of the invention. The illustrative embodiments of the invention and their descriptions are used to explain the invention and do not constitute an improper limitation of the invention.

[0012] The embodiments are further described below with reference to the accompanying drawings, wherein: Figure 1 This is a flowchart of the UAV target tracking method based on the eagle eye dual foveal vision mechanism in an embodiment of the present invention; Figure 2 This is a structural diagram of the UAV target tracking method based on the eagle eye dual foveal vision mechanism in an embodiment of the present invention; Figure 3 This is a schematic diagram of the structure of the dual-central concave interactive attention network in an embodiment of the present invention; Figure 4 This is a schematic diagram of the template update process in an embodiment of the present invention. Detailed Implementation

[0013] It should be noted that the following detailed descriptions are exemplary and intended to provide further illustration of the invention. Unless otherwise specified, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention pertains.

[0014] It should be noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit the exemplary embodiments of the invention. The terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion, for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus.

[0015] Where there is no conflict, the embodiments and features in the embodiments of the present invention can be combined with each other.

[0016] In this embodiment, all data acquisition is carried out in accordance with laws and regulations and with user consent, and the data is used legally.

[0017] Example 1 like Figure 1 and Figure 2 As shown, this embodiment provides a UAV target tracking method based on the eagle-eye dual-foveal vision mechanism, including: S1: Acquire video images and divide the current video frame into template images and search images; in the first frame of the video, crop out an image centered on the target as a fixed template based on the given initial bounding box of the target. S2: Input the template image and the search image into the feature extraction network to obtain the template features and search features accordingly; S3: Construct a dual-foveal interactive attention network, input the template features and search features into the dual-foveal interactive attention network to obtain enhanced search features; S4: Concatenate the search features with the enhanced search features and input them into the prediction network to obtain the target information in the current frame; S5: Based on the preset update strategy, determine whether the current frame needs to be updated; if it needs to be updated, generate a refined foreground mask based on the prediction result of the current frame, and extract dynamic template features from the current frame based on the mask, and replace the template features with dynamic template features; input the dynamic template features and the fixed template together into the dual concave interactive attention network for tracking of the next frame.

[0018] The specific steps of S1 are as follows: The system acquires video images captured by the drone and divides the current video frame into a template image and a search image. The template image is used to represent the initial features of the target, while the search image is used for target search and localization in subsequent frames. The template image and search image are resized; in this embodiment, the template image is resized to 192×192 pixels, and the search image is resized to 384×384 pixels.

[0019] In the first frame of the video, an image centered on the target is cropped based on the given initial bounding box of the target as a fixed template, which remains unchanged during subsequent tracking. In this embodiment, the size of the cropped fixed template is 192×192 pixels.

[0020] The specific steps of S2 are as follows: S2.1: First, perform path embedding on the resized template image and search image. That is, use convolution to cut the resized template image and search image into fixed-size image blocks, and then flatten the image blocks to obtain the one-dimensional feature sequences of the template image and search image respectively.

[0021] S2.2: Introduce learnable positional encodings for each position in the one-dimensional feature sequences of the template image and the search image; S2.3: Constructing the Feature Extraction Network. In this embodiment, the feature extraction network adopts the existing Transformer model. An early candidate region elimination module is introduced into the feature extraction network. This module is deployed after the template image token and the search image token are concatenated, and before entering the feature extraction network (i.e., the first TransformerEncoder Block). It directly filters the concatenated joint token sequence, removing redundant candidate region tokens, and then sends the simplified token sequence into the deep network for feature interaction. The early candidate region elimination module is a technique used to suppress background interference and highlight potential target regions in the early stages of feature extraction. In UAV target tracking tasks, search images often contain a large amount of irrelevant background. This module quickly filters and retains regions that may contain targets, thereby reducing subsequent computation and improving the model's ability to discriminate targets.

[0022] S2.4: The one-dimensional feature sequences of the template image and the search image after introducing learnable position encoding are input into the feature extraction network, and the feature extraction network outputs template features and search features.

[0023] The specific steps for S3 are as follows: like Figure 3As shown, a dual foveal interactive attention network (DIAM) is constructed, comprising two parallel sub-modules and a fusion unit. The shallow foveal interactive attention module (SFIA) simulates the shallow foveal function of an eagle eye, capturing coarse-grained semantic relationships between template features and search features. The deep foveal interactive attention module (DFIA) simulates the deep foveal function of an eagle eye, mining fine-grained semantic relationships between template features and search features. The feature fusion unit fuses the outputs of the shallow and deep foveal interactive attention modules to generate the final interactive features, i.e., the enhanced search features.

[0024] The shallow concave interactive attention module uses global representations as keys and values, integrating them into point queries to achieve global context updates and establish coarse-grained semantic relevance.

[0025] The working process of the shallow concave interactive attention module is as follows: Define G(X) as the pixel set obtained after global average pooling of the feature map, and use average pooling on the template branch to obtain the global context G( In order to build the computational foundation for the attention mechanism, the extracted global context is... Global keys are generated by projecting through linear mapping transformations. and global value For searching image feature maps Extract its point-level query vector This is then added element-wise to the learnable query embedding (QE) to enhance the feature representation of the search region. The enhanced query features are then combined with the global key of the template. Matrix multiplication is performed to calculate preliminary global relevance, injecting coarse information from the template into each pixel of the search region, resulting in a similarity score output by the shallow concave interactive attention module. This process can be defined as: = .

[0026] The deep concave interactive attention module uses local detail information as keys and values, injects it into point queries, and updates it within the context region. Regions with strong correlation are given high weights, while regions with weak correlation are given low weights, thus establishing fine-grained relevance.

[0027] The working process of the deep concave interactive attention module is as follows: Input template features and search features Targeting search features A single pixel at (i,j) (i,j) generates single-pixel-level query features through linear mapping. (i,j) represents k centered at (i,j). k is the set of pixels within the sliding window. [ [i,j] corresponds to the template feature map k centered at (i,j) k sub-regions; after obtaining the features of this local sub-region, it is projected through a linear mapping to generate local keys for local interactions. and local values Subsequently, point queries are dynamically adjusted by introducing learnable query embedding (QE), integrating enhanced query features with the local keys of the template. Matrix multiplication is performed to calculate a fine-grained preliminary similarity score, which is the similarity score output by the deep fovea interactive attention module. And with the learning of easily discriminative features; fine-grained similarity scoring. The calculation formula is defined as follows: =

[0028] The specific calculation process of the feature fusion unit is as follows: To measure the importance between fine-grained and coarse-grained relationships, the similarity score output by the deep fovea interaction attention module is used. Similarity score between the shallow fovea and the interaction attention module output. The importance weights of the two paths are calculated using Softmax. and :

[0029] Here, PE stands for Learnable Position Embedding.

[0030] Then, the attention weights were respectively and Applied to local values and global values Finally, using 1 The first convolutional layer adjusts the contribution of each pixel in the spatial dimension, resulting in the output of the double foveal interactive attention module, which is the enhanced search feature. The specific calculation formula is as follows: Output=Conv( + ) The specific steps of S4 are as follows: S4.1: The search features obtained by the feature extraction network are concatenated with the search features enhanced by the dual foveal interactive attention network.

[0031] S4.2: Construct the prediction network, which includes three fully convolutional network (FCN) branches. Each FCN branch consists of multiple ConvBN-ReLU layers; the specific number of layers is not limited in this embodiment. These three branches are: a classification prediction branch that distinguishes the classification probability between the target and the background; a center offset prediction branch that compensates for center point quantization errors; and a regression prediction branch that determines the bounding box regression parameters to determine the target scale. The corresponding outputs are the classification score, the offset score, and the regression score.

[0032] S4.3: The concatenated features are fed into the three fully convolutional network branches to output the target information in the current frame. That is, the target information in the current frame includes: output classification score, offset score, and regression score.

[0033] The specific steps of S5 are as follows: S5.1: Based on a preset update strategy, determine whether the current frame needs a template update. The preset update strategy means that when the classification confidence score exceeds a set value or reaches a set frame interval threshold, it is determined that a template update is needed. In this embodiment, when the classification confidence score exceeds 0.9 or reaches a 30-frame interval threshold, it is determined that a template update is needed.

[0034] S5.2: As Figure 4 As shown, a dynamic template is generated if an update is needed. First, a feature-level filtering mask is generated based on the search frame features. Simultaneously, an image-level filtering mask is generated based on the predicted bounding boxes of the search frames. Subsequently, and A bitwise AND operation (&) is performed, and the intersection of the two operations is used to generate a refined mask, thereby effectively removing background noise. Finally, this refined mask is compared with the search image. The Hadamard product operation is performed to accurately extract the target region, ultimately obtaining dynamic template features for subsequent tracking.

[0035] S5.3: Replace the template features with dynamic template features; input the dynamic template features and the fixed template together into the dual fovea interactive attention network for tracking in the next frame.

[0036] Example 2 This embodiment provides a UAV target tracking system based on the eagle-eye dual-foveal vision mechanism, including: The image acquisition module is configured to acquire video images, divide the current video frame into template images and search images; in the first frame of the video, based on the given initial bounding box of the target, crop out an image centered on the target as a fixed template; A feature extraction network module is configured to input the template image and the search image into the feature extraction network to obtain template features and search features accordingly. A dual-foveal interactive attention network module is configured to construct a dual-foveal interactive attention network, inputting the template features and search features into the dual-foveal interactive attention network to obtain enhanced search features; A prediction network module is configured to concatenate the search features with the enhanced search features and input them into the prediction network to obtain information about the target in the current frame; The template update module is configured to determine whether the current frame needs to be updated based on a preset update strategy. If an update is required, a refined foreground mask is generated based on the prediction result of the current frame, and dynamic template features are extracted from the current frame based on the mask. The template features are then replaced with dynamic template features. The dynamic template features and the fixed template are input together into the dual foveal interactive attention network for tracking of the next frame.

[0037] It should be noted that the image acquisition module, feature extraction network module, dual foveal interactive attention network module, prediction network module, and template update module described above correspond to steps S1 to S5 in Embodiment 1. The examples and application scenarios implemented by these modules and their corresponding steps are the same, but they are not limited to the content disclosed in Embodiment 1. It should also be noted that these modules, as part of the system, can be executed in a computer system, such as a set of computer-executable instructions.

[0038] The descriptions of each embodiment in the above embodiments have different focuses. For parts not described in detail in a certain embodiment, please refer to the relevant descriptions in other embodiments.

[0039] The proposed system can be implemented in other ways. For example, the system embodiments described above are merely illustrative, and the division of modules described above is only a logical functional division. In actual implementation, there may be other division methods. For example, multiple modules may be combined or integrated into another system, or some features may be ignored or not executed.

[0040] Example 3 This embodiment also provides an electronic device, including: one or more processors, one or more memories, and one or more computer programs; wherein, the processor is connected to the memory, and the one or more computer programs are stored in the memory. When the electronic device is running, the processor executes the one or more computer programs stored in the memory to cause the electronic device to perform the method described in Embodiment 1.

[0041] It should be understood that in this embodiment, the processor can be a central processing unit (CPU), or it can be other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor can be a microprocessor or any conventional processor, etc.

[0042] Memory may include read-only memory and random access memory, and provides instructions and data to the processor. A portion of memory may also include non-volatile random access memory. For example, memory may also store information about the device type.

[0043] In the implementation process, each step of the above method can be completed by the integrated logic circuits in the processor hardware or by software instructions.

[0044] The method in Embodiment 1 can be directly implemented by a hardware processor, or implemented by a combination of hardware and software modules within the processor. The software modules can reside in readily available storage media in the art, such as random access memory, flash memory, read-only memory, programmable read-only memory, electrically erasable programmable memory, or registers. This storage medium is located in memory; the processor reads information from the memory and, in conjunction with its hardware, completes the steps of the above method. To avoid repetition, a detailed description is not provided here.

[0045] Those skilled in the art will recognize that the units and algorithm steps described in connection with the various examples of this embodiment can be implemented in electronic hardware or a combination of computer software and electronic hardware. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this invention.

[0046] Example 4 Embodiment 4 of the present invention provides a computer-readable storage medium.

[0047] A computer-readable storage medium having a program stored thereon that, when executed by a processor, implements the steps of the method as described in Embodiment 1 of the present invention.

[0048] The detailed steps are the same as those provided in Example 1, and will not be repeated here.

[0049] The above description is merely a preferred embodiment of the present invention and is not intended to limit the invention. Various modifications and variations can be made to the present invention by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the scope of protection of the present invention.

Claims

1. A UAV target tracking method based on an eagle-eye dual-foveal vision mechanism, characterized in that, include: Acquire video images and divide the current video frame into template images and search images; In the first frame of the video, an image centered on the target is cropped as a fixed template based on the given initial bounding box of the target. The template image and the search image are input into a feature extraction network to obtain template features and search features respectively. A dual-foveal interactive attention network is constructed, and the template features and search features are input into the dual-foveal interactive attention network to obtain enhanced search features; The search features are concatenated with the enhanced search features and input into the prediction network to obtain the target information in the current frame; Based on the preset update strategy, determine whether the current frame needs to be updated. If an update is required, a refined foreground mask is generated based on the prediction results of the current frame, and dynamic template features are extracted from the current frame based on the mask, and the template features are replaced with dynamic template features. The dynamic template features are input together with the fixed template into the dual fovea interactive attention network for tracking in the next frame.

2. The UAV target tracking method based on the eagle-eye dual-foveal vision mechanism as described in claim 1, characterized in that, The step of inputting the template image and the search image into the feature extraction network to obtain the corresponding template features and search features is as follows: Path embedding is performed on the template image and the search image to obtain a one-dimensional feature sequence of the template image and the search image; Learnable positional encodings are introduced for each position in the one-dimensional feature sequences of the template image and the search image; A feature extraction network is constructed, and an early candidate region elimination module is introduced into the feature extraction network; The one-dimensional feature sequences of the template image and the search image after incorporating learnable positional encoding are input into the feature extraction network, and the feature extraction network outputs template features and search features.

3. The UAV target tracking method based on the eagle-eye dual-foveal vision mechanism as described in claim 1, characterized in that, The feature extraction network uses the Transformer model.

4. The UAV target tracking method based on the eagle-eye dual-foveal vision mechanism as described in claim 1, characterized in that, The dual-foveal interactive attention network includes: a shallow foveal interactive attention module, a deep foveal interactive attention module, and a feature fusion unit; The shallow concave interactive attention module captures coarse-grained semantic associations between template image features and search image features; The deep concave interactive attention module mines fine-grained semantic associations between template image features and search image features; The feature fusion unit fuses the outputs of the shallow fovea interactive attention module and the deep fovea interactive attention module to obtain enhanced search features.

5. The UAV target tracking method based on the eagle-eye dual-foveal vision mechanism as described in claim 1, characterized in that, The prediction network comprises three fully convolutional network branches; The three fully convolutional network branches are: a classification prediction branch that distinguishes the classification probability between the target and the background, an offset prediction branch that compensates for the center offset error of the center point quantization, and a regression prediction branch that determines the bounding box regression parameters to determine the target scale. The corresponding outputs are classification score, offset score, and regression score.

6. The UAV target tracking method based on the eagle-eye dual-foveal vision mechanism as described in claim 1, characterized in that, The preset update strategy is as follows: when the classification confidence score exceeds a set value or reaches a set frame interval threshold, it is determined that a template update is required.

7. The UAV target tracking method based on the eagle-eye dual-foveal vision mechanism as described in claim 1, characterized in that, The process of generating a refined foreground mask based on the prediction results of the current frame, and extracting dynamic template features from the current frame based on this mask, specifically involves: A feature-level filtering mask is generated based on the search frame features, and an image-level filtering mask is generated based on the predicted bounding box of the search frame. A bitwise AND operation is performed on the feature-level filtering mask and the image-level filtering mask, and the intersection of the two is taken to generate a refined mask. The refined mask and the search image are subjected to a Hadamard product operation to extract the target region and obtain dynamic template features.

8. A UAV target tracking system based on an eagle-eye dual-foveal vision mechanism, characterized in that, include: The image acquisition module is configured to acquire video images, divide the current video frame into template images and search images; in the first frame of the video, based on the given initial bounding box of the target, crop out an image centered on the target as a fixed template; A feature extraction network module is configured to input the template image and the search image into the feature extraction network to obtain template features and search features accordingly. A dual-foveal interactive attention network module is configured to construct a dual-foveal interactive attention network, inputting the template features and search features into the dual-foveal interactive attention network to obtain enhanced search features; A prediction network module is configured to concatenate the search features with the enhanced search features and input them into the prediction network to obtain information about the target in the current frame; The template update module is configured to determine whether the current frame needs a template update based on a preset update strategy. If an update is required, a refined foreground mask is generated based on the prediction results of the current frame, and dynamic template features are extracted from the current frame based on the mask, and the template features are replaced with dynamic template features. The dynamic template features are input together with the fixed template into the dual fovea interactive attention network for tracking in the next frame.

9. An electronic device, characterized in that, It includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor, when executing the program, implements the steps of the UAV target tracking method based on the eagle-eye dual-foveal vision mechanism as described in any one of claims 1-7.

10. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the program is executed by the processor, it implements the steps in the UAV target tracking method based on the eagle-eye dual foveal vision mechanism as described in any one of claims 1-7.