An unmanned aerial vehicle-based traffic intersection rotation multi-target tracking method and system

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By combining the rotation target detection and rotation attention similarity learning modules with a ternary quadratic cascaded matching strategy, the multi-target tracking of UAV aerial photography at traffic intersections is optimized, solving the tracking accuracy problem in UAV aerial photography at traffic intersections and improving the tracking accuracy and robustness of vehicle and pedestrian targets.

CN117372900BActive Publication Date: 2026-06-16SHANDONG UNIV

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: SHANDONG UNIV
Filing Date: 2023-09-28
Publication Date: 2026-06-16

Application Information

Patent Timeline

28 Sep 2023

Application

16 Jun 2026

Publication

CN117372900B

IPC: G06V20/17; G06V20/54; G06V10/74; G06V10/82; G06N3/045; G06N3/08

CPC: G06V20/17; G06V20/54; G06V10/761; G06V10/82; G06N3/045; G06N3/08; G06V2201/07

AI Tagging

Application Domain

Character and pattern recognition Neural learning methods

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

⚠Technical Problem

In drone aerial photography of traffic intersections, existing multi-target tracking algorithms have poor performance over long periods of time. Vehicles and pedestrians are densely packed with few features, and obstructions such as trees have a significant impact, resulting in low tracking accuracy.

⚗Method used

A rotating target detector and a rotating attention similarity learning module are used to extract features. Kalman filtering algorithm is used to predict bounding boxes. A ternary quadratic cascaded matching strategy of Euclidean distance, area intersection-to-union ratio and cosine distance is used to optimize target tracking. Hungarian algorithm is used for trajectory matching.

🎯Benefits of technology

It improves the accuracy of vehicle and pedestrian target tracking in drone aerial photography of traffic intersections, reduces the impact of obstructions, and enhances the robustness of long-term tracking.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN117372900B_ABST

Patent Text Reader

Abstract

The application discloses a kind of traffic intersection rotating multi-target tracking method and system based on unmanned aerial vehicle, input traffic video stream, frame by frame rotating target detection, train rotating attention similarity learning module, extract the features of target object;Calculate the cosine distance between the features of the tth and (t-1)th frame target tracking object;Calculate the Euclidean distance between the detection box of the tth frame target detection object and the prediction box of the tth frame target tracking object and the area intersection ratio;According to the Euclidean distance, adjacent detection box and distant detection box are divided;For adjacent detection box, input the first cost matrix into hungarian algorithm to obtain the first tracking trajectory;For distant detection box and the target tracking object that fails to match the first time, input the second cost matrix into hungarian algorithm to obtain the second tracking trajectory;The first and second tracking trajectories are merged to obtain the tracking trajectory of the target tracking object of the tth frame image, and the target tracking object list is updated.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of multi-target tracking technology in complex traffic scenarios, and in particular to a method and system for multi-target tracking at traffic intersections based on unmanned aerial vehicles (UAVs). Background Technology

[0002] The statements in this section merely refer to the background art related to this invention and do not necessarily constitute prior art.

[0003] Vision-based multi-target tracking plays an increasingly important role in intelligent transportation systems, and drones, with their high mobility and flexibility, are widely used in intelligent traffic monitoring systems. Drone aerial videos contain rich traffic information, and processing these videos to automatically extract useful information is of great significance. Multi-target tracking based on drone aerial videos is an important component of traffic monitoring systems, playing a crucial role in traffic management, security monitoring, and autonomous driving.

[0004] Drone-based multi-target rotation tracking at traffic intersections presents some unique challenges.

[0005] 1) In traffic intersection scenarios, traffic participants engage in a large number of turning behaviors, and the orientation of vehicles and the posture of pedestrians change significantly from the perspective of the drone.

[0006] 2) Drones can move and track targets in three-dimensional space, but traffic intersections are densely packed with targets, and the features of vehicle and pedestrian targets are relatively few, which brings difficulties to target detection and ReID.

[0007] 3) From the perspective of drones, trees, traffic lights, overpasses, etc. in traffic intersections can cause long-term obstruction of vehicles and pedestrians, posing a challenge to multi-target tracking.

[0008] Regardless of the implementation platform, the core idea of multi-object tracking for traffic participants is to effectively associate targets in adjacent frames through object detection and ReID. The BoT-SORT algorithm proposes a camera motion compensation strategy to effectively alleviate the camera motion problem; the SMILEtrack algorithm proposes a Transformer-based similarity learning module to optimize the multi-object tracker. Although multi-object tracking algorithms based on horizontal detection boxes are relatively mature, in drone aerial photography of traffic intersections, due to the dense concentration of vehicles and pedestrians, there is significant redundancy and overlap between horizontal object detection boxes. Furthermore, traditional multi-object tracking methods are prone to failure during long-term tracking, often failing to achieve good performance when addressing the unique challenges of multi-object tracking at drone aerial traffic intersections. Summary of the Invention

[0009] To address the shortcomings of existing technologies, this invention provides a method and system for rotating multi-target tracking at traffic intersections based on unmanned aerial vehicles (UAVs). This solves the problem of poor long-term tracking performance of existing technologies in UAV aerial photography scenarios at traffic intersections, and improves the tracking accuracy of traffic participants such as vehicles and pedestrians.

[0010] On the one hand, a method for multi-target rotation tracking at traffic intersections based on unmanned aerial vehicles (UAVs) is provided, including:

[0011] The traffic video stream is input frame by frame into the rotating target detector. All traffic participants in each frame are considered as target objects, resulting in a rotating detection box for each target object in each frame. The target object image in each detection box is rotated and aligned. The rotated and aligned target objects are then input into a trained rotation attention similarity learning network to extract the features of the target objects. The historical trajectory of the target tracking object in frame t-1 is input into the Kalman filter algorithm to obtain the predicted box of the target tracking object in frame t.

[0012] Calculate the cosine distance between the features of the target detected object in frame t and the features of the target tracked object in frame (t-1); calculate the Euclidean distance between the detection bounding box of the target detected object in frame t and the predicted bounding box of the target tracked object in frame t; calculate the area intersection-over-union ratio between the detection bounding box of the target detected object in frame t and the predicted bounding box of the target tracked object in frame t.

[0013] When the Euclidean distance is less than the set threshold, the detection box of the target object in the t-th frame image is identified as a neighboring detection box; when the Euclidean distance is greater than the set threshold, the detection box of the target object in the t-th frame image is identified as a distant detection box.

[0014] Based on the Euclidean distance and the area intersection-union ratio (AUC), a first cost matrix is determined; for neighboring detection boxes, the first cost matrix is input into the Hungarian algorithm to obtain the first tracking trajectory; based on the area intersection-union ratio (AUC) and the cosine distance, a second cost matrix is determined; for distant detection boxes and target tracking objects that failed to match in the first attempt, the second cost matrix is input into the Hungarian algorithm to obtain the second tracking trajectory; the first and second tracking trajectories are merged to obtain the tracking trajectory of the target tracking object in the t-th frame image, and the target tracking object list is updated.

[0015] On the other hand, a UAV-based multi-target rotation tracking system for traffic intersections is provided, including:

[0016] The detection box and prediction box acquisition module is configured to: input the traffic video stream frame by frame into the rotating target detector, treat all traffic participants in each frame as target detection objects, and obtain the detection box of the target detection object in each frame; rotate and align the target detection object image in each detection box, input the aligned target detection object into the trained rotation attention similarity learning network, and extract the features of the target detection object; input the historical trajectory of the target tracking object in the (t-1)th frame into the Kalman filter algorithm to obtain the prediction box of the target tracking object in the tth frame;

[0017] The calculation module is configured to: calculate the cosine distance between the features of the target detected object in frame t and the features of the target tracked object in frame (t-1); calculate the Euclidean distance between the detection box of the target detected object in frame t and the prediction box of the target tracked object in frame t; and calculate the area intersection-union ratio between the detection box of the target detected object in frame t and the prediction box of the target tracked object in frame t.

[0018] The comparison module is configured to: when the Euclidean distance is less than a set threshold, identify the detection box of the target object in the t-th frame image as a neighboring detection box; when the Euclidean distance is greater than the set threshold, identify the detection box of the target object in the t-th frame image as a distant detection box.

[0019] The output module is configured to: determine a first cost matrix based on the Euclidean distance and the area intersection-union ratio; input the first cost matrix into the Hungarian algorithm for neighboring detection boxes to obtain the first tracking trajectory; determine a second cost matrix based on the area intersection-union ratio and the cosine distance; input the second cost matrix into the Hungarian algorithm for distant detection boxes and target tracking objects that failed to match in the first attempt to obtain the second tracking trajectory; merge the first tracking trajectory and the second tracking trajectory to obtain the tracking trajectory of the target tracking object in the t-th frame image, and update the target tracking object list.

[0020] Furthermore, an electronic device is also provided, including:

[0021] Memory, used for non-transitory storage of computer-readable instructions; and

[0022] Processor, for executing the computer-readable instructions,

[0023] When the computer-readable instructions are executed by the processor, they perform the method described in the first aspect above.

[0024] In another aspect, a storage medium is also provided for non-transitory storage of computer-readable instructions, wherein when the non-transitory computer-readable instructions are executed by a computer, the instructions of the method described in the first aspect are executed.

[0025] In another aspect, a computer program product is also provided, including a computer program that, when run on one or more processors, is used to implement the method described in the first aspect above.

[0026] One of the above technical solutions has the following advantages or beneficial effects:

[0027] To reduce the impact of frequent turns by traffic participants and drone photography, an attitude correction module was designed to unify the attitudes of traffic participants, thereby reducing the influence of different vehicle orientations and different pedestrian attitudes, and a more accurate rotation detection box was used to represent the tracking results.

[0028] To address the problem of dense targets and few target features at traffic intersections, a similarity learning module based on rotational self-attention mechanism (RA-SLM) was designed, which effectively improves the accuracy of vehicle and pedestrian re-identification in traffic intersection scenarios.

[0029] To alleviate the problem of long-term occlusion in complex traffic scenarios, a ternary quadratic cascaded matching strategy (TCM) based on Euclidean distance, area intersection-union ratio, and cosine distance is proposed, which improves the long-term tracking performance of vehicles and pedestrians in traffic intersection scenarios. Attached Figure Description

[0030] The accompanying drawings, which form part of this invention, are used to provide a further understanding of the invention. The illustrative embodiments of the invention and their descriptions are used to explain the invention and do not constitute an improper limitation of the invention.

[0031] Figure 1 This is an example of a UAV multi-target tracking network based on rotational self-attention and ternary quadratic cascaded matching.

[0032] Figure 2 This is the target detection box pose correction diagram for Example 1;

[0033] Figure 3 The overall structure of the similarity learning module RA--SLM network based on rotational self-attention mechanism in Example 1 is shown.

[0034] Figure 4 This is a network structure diagram of the CSA feature extraction module in Example 1;

[0035] Figure 5 Here is a flowchart of the ternary quadratic cascaded matching process in Example 1;

[0036] Figures 6(a) and 6(b) are schematic diagrams of two cascaded matchings in the ternary quadratic cascaded matching strategy TCM of Example 1;

[0037] Figure 7This is a flowchart of the two matching processes in Example 1;

[0038] Figures 8(a) and 8(b) show the results of multi-target tracking in Example 1;

[0039] Wherein, ① represents the first target object to be tracked; ② represents the second target object to be tracked. Detailed Implementation

[0040] It should be noted that the following detailed descriptions are exemplary and intended to provide further illustration of the invention. Unless otherwise specified, all technical and scientific terms used in this invention have the same meaning as commonly understood by one of ordinary skill in the art to which this invention pertains.

[0041] Where there is no conflict, the embodiments and features in the embodiments of the present invention can be combined with each other.

[0042] All data acquisition in this embodiment is carried out in accordance with laws and regulations and with user consent, and the data is used legally.

[0043] like Figure 1 As shown, this invention proposes a novel UAV-based multi-target rotation tracking structure for traffic intersections, which features accurate bounding boxes and enhanced robustness.

[0044] First, in order to reduce the impact of tracking loss when traffic participants turn and drone shooting actions, an attitude correction module based on the rotating target detection results was designed to unify the vehicle attitude, and a similarity learning module based on rotational self-attention (RA--SLM) was proposed to extract the appearance features of the target.

[0045] Secondly, to alleviate the problem of long-term occlusion in complex traffic scenarios, a ternary quadratic cascade matching strategy (TCM) based on Euclidean distance, area intersection-union ratio, and cosine distance is proposed. First, the target rotation detection box is divided into neighboring detection boxes and distant detection boxes according to the Euclidean distance with the estimated state. Then, two cascade matchings are performed based on the fused cost matrix.

[0046] Example 1

[0047] This embodiment provides a method for multi-target rotation tracking at traffic intersections based on unmanned aerial vehicles (UAVs);

[0048] like Figure 7 As shown, a method for multi-target rotation tracking at traffic intersections based on unmanned aerial vehicles (UAVs) includes:

[0049] S101: Input the traffic video stream frame by frame into the rotating target detector, treat all traffic participants in each frame as target detection objects, and obtain the rotating detection box for each target detection object in each frame; perform a rotation alignment operation on the target detection object image in each detection box, and input the rotated and aligned target detection object into the trained rotation attention similarity learning network to extract the features of the target detection object; input the historical trajectory of the target tracking object in the (t-1)th frame into the Kalman filter algorithm to obtain the predicted box of the target tracking object in the tth frame image;

[0050] S102: Calculate the cosine distance between the features of the target detected object in frame t and the features of the target tracked object in frame (t-1); calculate the Euclidean distance between the detection box of the target detected object in frame t and the prediction box of the target tracked object in frame t; calculate the area intersection-union ratio between the detection box of the target detected object in frame t and the prediction box of the target tracked object in frame t.

[0051] S103: When the Euclidean distance is less than the set threshold, the detection box of the target object in the t-th frame image is identified as a neighboring detection box; when the Euclidean distance is greater than the set threshold, the detection box of the target object in the t-th frame image is identified as a distant detection box.

[0052] S104: Determine the first cost matrix based on the Euclidean distance and the area intersection-union ratio; input the first cost matrix into the Hungarian algorithm for the nearest neighbor detection box to obtain the first tracking trajectory; determine the second cost matrix based on the area intersection-union ratio and the cosine distance; input the second cost matrix into the Hungarian algorithm for the distant detection box and the target tracking object that failed the first match to obtain the second tracking trajectory; merge the first tracking trajectory and the second tracking trajectory to obtain the tracking trajectory of the target tracking object in the t-th frame image, and update the target tracking object list.

[0053] It should be understood that the target detection object refers to the image within the detection box of all traffic participants in each frame of the image; the target tracking object refers to the object in the target tracking object list.

[0054] Furthermore, such as Figure 2 As shown, S101: The traffic video stream is input frame by frame into the rotating target detector. All traffic participants in each frame are considered as target detection objects, resulting in a rotating detection box for the target detection objects in each frame, including:

[0055] Multiple targets in a traffic video stream are identified by rotation detection boxes. One rotation detection box is identified for each target, and several rotation detection boxes are identified for each frame.

[0056] For example, if there are P traffic participants in a certain frame of an image, then there will be P rotating detection boxes.

[0057] In this embodiment of the invention, the recognition of the rotating detection box is achieved using the open-source rotating target detection model GlidingVertex.

[0058] Furthermore, the traffic video stream in S101 is collected by a drone.

[0059] Further, S101: Rotating and aligning the target detection object image in each detection box specifically includes:

[0060] Each rotating detection frame is cut out, and the orientation of each cut-out detection frame is corrected so that the long side of the detection frame is parallel or perpendicular to the horizontal plane.

[0061] Furthermore, such as Figure 3 As shown, in step S101: the aligned target detection object is input into the trained rotational attention similarity learning network to extract the features of the target detection object. The trained rotational attention similarity learning network is used for:

[0062] First, the aligned target detection objects in frame t are resized to the same size 32*64, and then input into OR-ResNet18 built with rotated convolution kernels to obtain feature maps. The feature maps output by the feature extraction network are then split into several slices according to channels.

[0063] For each slice, add a positional code to obtain a sequence of feature maps with positional codes;

[0064] The sequence of feature maps with positional encoding is input into the attention mechanism module;

[0065] The output values of the attention mechanism module are concatenated and spliced together to obtain the spliced features;

[0066] The stitched features are input into the fully connected layer to obtain the feature vector of the current target object in the t-th frame image; thus, the features of the target detection object are obtained.

[0067] Similarly, feature vectors are extracted from all target objects in the t-th frame image to obtain the feature vectors corresponding to all target objects in the t-th frame image.

[0068] Furthermore, the rotational attention similarity learning network includes:

[0069] When two different images are input simultaneously, they first pass through a CSA feature extraction module with shared weights, and then the features are aggregated using a fully connected layer to obtain a feature vector of length 1024. Then, the cosine similarity between the two images is calculated. The smaller the cosine distance, the higher the correlation between the target features.

[0070] Furthermore, such as Figure 4 As shown, the CSA feature extraction module includes:

[0071] The OR-ResNet18 module, channel splitting module, position encoding module, self-attention mechanism module, concatenation module, and fully connected layer are connected sequentially.

[0072] OR-ResNet18 is used to extract feature maps;

[0073] The channel splitting module is used to split the extracted feature map into several slices according to channels;

[0074] The position encoding module is used to add position encoding to each slice to obtain a sequence of feature maps with position encoding.

[0075] The self-attention mechanism module is used to process feature map sequences with positional encoding;

[0076] The concatenation module is used to concatenate the output values of the attention mechanism module to obtain concatenated features;

[0077] The fully connected layer is used to process the stitched features to obtain the feature vector of the current target object in the t-th frame image; thus, the features of the target detection object are obtained.

[0078] Furthermore, the OR-ResNet18 built with rotated convolutional kernels replaces the original ResNet18's 7*7 convolutional kernels with 3*3 convolutional kernels, and then replaces all convolutional kernels with rotated convolutional kernels.

[0079] Furthermore, the OR-ResNet18, built with rotated convolutional kernels, replaces the original ResNet18's 7x7 convolutional kernels with 3x3 convolutional kernels to reduce the number of parameters and improve network speed. Then, the convolutional kernels are replaced with rotated convolutional kernels to accommodate ReID of target images with multiple orientations after rotation and alignment. The number of learnable channels is reduced by a factor of four compared to ResNet-18, and the number of parameters is also reduced by a factor of four. This invention refers to the improved ResNet18 as OR-ResNet18, and the overall network structure of OR-ResNet18 is shown in Table 1. In practical applications, the number of convolutional layer channels can be increased according to the complexity of the dataset to obtain better feature extraction results and achieve higher similarity for the same target in different poses.

[0080] Table 1 OR-ResNet-18 Network Structure Table

[0081]

[0082] To fully integrate the relationships of the same target from multiple orientations, after the last layer of OR-ResNet18, the 512-channel feature map is split into four 128-channel patches according to the order of the rotating convolution kernels. Then, for each patch, a position embedding is added. Each patch can be represented by the following equation:

[0083] S i =S i +E p i = A, B, C, D, E p =1, 2, 3, 4#

[0084] Finally, the application yields a feature map sequence S = {S} containing location information. A ~S D}, which serves as the input to the Attention Block.

[0085] The Transformer computes the attention function by packing queries into matrix Q, and also packing keys and values into matrices K and V. The attention computation is represented as follows:

[0086]

[0087] Where d k This is the dimension of the key vector. To generate the queries, keys, and values of the attention block, this invention applies a fully connected layer to each slice patch. Each patch has an output S after passing through the QKV attention block. iThis invention will use the output S = {S} of each patch of the QKV attention block. A ~S D} can be expressed as the following equation:

[0088] S A =SA(Q S1 K S1 V S1 )+CA(Q S1 K S2 V S2 )+CA(Q S1 K S3 V S3 )+CA(Q S1 K S4 V S4 )

[0089] S B =SA(Q S2 K S2 V S2 )+CA(Q S2 K S1 V S1 )+CA(Q S2 K S3 V S3 )+CA(Q S2 K S4 V S4 )

[0090] S C =SA(Q S3 K S3 V S3 )+CA(Q S3 K S1 V S1 )+CA(Q S3 K S2 V S2 )+CA(Q S3 K S4 V S4 )

[0091] S D =SA(Q S4 K S4 V S4 )+CA(Q S4 K S1 V S1 )+CA(Q S4 K S2 V S2 )+CA(Q S4 K S3 VS3 )

[0092] Q Si Let K represent the query matrix of Si. Si V represents the key matrix of Si. Si Let S represent the value matrix of Si, SA represent self-attention, and CA represent cross-attention. A S B S C S D These represent the feature vectors of the four slices respectively.

[0093] The image is input into the OR-ResNet18 network to obtain feature maps, which are then split into four patches according to channel order. Each feature map patch undergoes attention calculation via an attention block, and the output is concatenated to obtain the image's attention features.

[0094] Furthermore, the training process of the trained rotational attention similarity learning network includes:

[0095] Construct a training set, which is a ReID dataset consisting of images of the target tracking object;

[0096] The training set is input into the rotational attention similarity learning network to train the network. Training is stopped when the network's loss function value no longer decreases or the number of iterations exceeds a set number, resulting in the trained rotational attention similarity learning network.

[0097] Further, S102: Calculate the cosine distance between the features of the target detection object in frame t and the features of the target tracking object in frame t-1, specifically including:

[0098]

[0099] Among them, M a (A, B) represents the cosine distance between the feature vectors of target object A and target object B in the t-th frame image. i Let B represent the i-th dimension of the feature vector of the target object A. i Let represent the i-th dimension of the feature vector of the target object B, and n represent the dimension of the feature vector.

[0100] It should be understood that cosine distance is a similarity metric that can be used to measure differences between individuals across dimensions. This invention generates feature vectors for corresponding vehicle or pedestrian targets during the matching phase and stores them after tracking the target's location information.

[0101] Further, S102: Calculate the Euclidean distance between the detection bounding box of the target object in the t-th frame image and the predicted bounding box of the target tracking object in the t-th frame image:

[0102]

[0103] Among them, M s (x, y) represents the Euclidean distance, n represents the vector dimension, and x i This refers to the i-th dimension of the current observed coordinates of the target detection object x, and y. i This refers to the i-th dimension of the current observed coordinates of the target object y.

[0104] The smaller the Euclidean distance, the smaller the motion difference between the predicted bounding box and the detected bounding box of the corresponding tracked target.

[0105] It should be understood that Euclidean distance is a common distance metric used to measure the spatial distance between individuals; the greater the distance, the greater the difference between individuals. This invention uses Euclidean distance to calculate the distance between the predicted bounding box and the detection bounding box of the tracked target.

[0106] Further, S102: Calculate the area intersection-union ratio (IUU) between the detection bounding box of the target object in frame t and the predicted bounding box of the target tracking object in frame t:

[0107]

[0108] M m (i,j)=1-RIOU(i,j)*Score

[0109] Where RIOU(i,j) represents the area intersection-union ratio between the predicted bounding box and the detected bounding box, M m (i, j) represents the area intersection-union cost, Score represents the detection box confidence, Area(I) represents the overlap area between the predicted box and the detection box, and Area(R) represents the overlap area between the predicted box and the detection box. i Area(R) represents the non-overlapping area of the predicted bounding boxes. j () indicates the area where the detection boxes do not overlap.

[0110] Figure 5 The diagram illustrates the cascaded matching strategy. First, the target detection box Rbbox in frame t is determined. Then, based on its Euclidean distance from the predicted trajectory, it is divided into two parts: Nearby Rbbox and Distant Rbbox. Finally, the target detection object and the target tracking object are cascaded matched twice according to the cost matrix.

[0111] Further, in S103: when the Euclidean distance is less than a set threshold, the detection box of the target object in the t-th frame image is identified as a neighboring detection box; when the Euclidean distance is greater than the set threshold, the detection box of the target object in the t-th frame image is identified as a distant detection box, wherein the set threshold specifically refers to a pixel distance of 25.

[0112] In Figures 6(a) and 6(b), the dashed boxes represent predicted boxes, and the solid boxes represent detected boxes. In Figure 6(a), the predicted box of the first target object ① has neighboring detected boxes, so the first cascade matching is performed, and the first detected box is successfully matched within the neighboring detected boxes. In Figure 6(b), the predicted box of the second target object ② has no neighboring detected boxes, so the second cascade matching is performed directly, and the second detected box is successfully matched within the distant detected boxes.

[0113] Furthermore, S104 also includes:

[0114] Stage 1: Calculate the first cost matrix C nearby Using the Hungarian algorithm combined with the cost matrix C nearby Prior matching completed. Nearest neighbor detection box D nearby The targets that failed to match in the middle and the trajectories that failed to match in the TL are placed into the first unmatched detection box D. remain The first unsuccessful match with the tracking object TL remain middle.

[0115] Stage 2: The second stage first calculates the alienation detection box D. distant The first unsuccessful match with the tracking object TL remain The second cost matrix C distant Then, a second cascaded matching is performed.

[0116] The next step is the same as the first phase, with the distancing detection box D... distant The target and TL failed to be matched. remain Trajectories that fail to match are placed into a second unmatched detection box D. rremain And the second unsuccessful tracking object TL rremain middle.

[0117] After completing the target association phase, this invention sets a threshold H = 0.7 to initialize a new trajectory. remain and D rremain Unmatched bounding boxes with a confidence level higher than 0.7 can be used to initialize new trajectories and set the TL. rremain If a tracking object fails to match for more than 100 consecutive frames, it is deleted, and the tracking result is finally obtained by rotating the bounding box.

[0118] Furthermore, S104 also includes:

[0119] If the first match is successful, the nearest detection box of the target object in frame t is connected to the target tracking object in frame t-1 to obtain the first tracking trajectory; if the first match fails, the nearest detection box with a confidence level higher than the set threshold is considered as a new candidate tracking object and the new candidate tracking object is stored in the temporary table.

[0120] If the second match is successful, the alienation detection box of the target detection object in frame t is connected to the target tracking object in frame t-1 to obtain the second tracking trajectory; if the second match fails, the alienation detection box with a confidence level higher than the set threshold is considered as a new candidate tracking object, and the new candidate tracking object is stored in the temporary table.

[0121] Further, S104: Determine the first cost matrix based on the Euclidean distance and the area intersection-union ratio; wherein, the calculation process of the first cost matrix is as follows:

[0122]

[0123]

[0124] C nearby =M m +M s

[0125] Among them, M s M represents the Euclidean distance matrix. m The area intersection-union ratio cost matrix is represented by α, which represents the balance coefficient 1 / 25, and θ is the area intersection-union ratio cost matrix. seu This indicates that the Euclidean distance threshold is 25, θ iou Indicates an area intersection-to-union threshold of 0.5, C nearby Let represent the first cost matrix.

[0126] Further, the second cost matrix is determined based on the area intersection-union ratio and the cosine distance, wherein the calculation process of the second cost matrix is as follows:

[0127] Take M a (i, j) and M m The smaller of (i, j) forms the second cost matrix:

[0128] C distant =min{M a M m}

[0129] Among them, M m M represents the area intersection-to-union cost matrix. aThe cosine distance matrix is represented. Figures 8(a) and 8(b) show the multi-target tracking results of Example 1.

[0130] Further, S104: updating the target tracking object list specifically includes:

[0131] For the first frame, all detected objects are considered as candidate tracking objects and stored in a temporary table. It is then determined whether the candidate tracking objects in the temporary table reappear in the next two consecutive frames. If they do reappear, the candidate tracking objects are identified as target tracking objects, and their features and numbers are stored in the target tracking object list. Otherwise, the candidate tracking objects are deleted from the temporary table.

[0132] For images not in the first frame, determine whether the tracking trajectory of each target detection object has been successfully matched. If so, it means that the current target detection object is already a target tracking object, and the detection box and its features of the current target are added to the target tracking list. If not, it means that the current target detection object is a new candidate tracking object. The new candidate tracking object is stored in a temporary table. It is determined whether the new candidate tracking object in the temporary table reappears in the next two consecutive frames. If it reappears, the new candidate tracking object is identified as a target tracking object, and the historical trajectory, features, and number of the target tracking object are stored in the target tracking object list. If it does not reappear, the new candidate tracking object is deleted from the temporary table.

[0133] If a target object in the target tracking object list does not appear for 100 consecutive frames, it is removed from the target tracking object list.

[0134] It should be understood that the determination of whether the candidate tracking object in the temporary table reappears in the next two consecutive frames is based on whether the candidate tracking object successfully matches the target detection object. If the candidate tracking object is successfully matched in three consecutive frames, it means that the candidate tracking object reappears in the next two consecutive frames.

[0135] Example 2

[0136] This embodiment provides a UAV-based multi-target rotation tracking system for traffic intersections, including:

[0137] The detection box and prediction box acquisition module is configured to: input the traffic video stream frame by frame into the rotating target detector, treat all traffic participants in each frame as target detection objects, and obtain the detection box of the target detection object in each frame; rotate and align the target detection object image in each detection box, input the aligned target detection object into the trained rotation attention similarity learning network, and extract the features of the target detection object; input the historical trajectory of the target tracking object in the (t-1)th frame into the Kalman filter algorithm to obtain the prediction box of the target tracking object in the tth frame;

[0138] The calculation module is configured to: calculate the cosine distance between the features of the target detected object in frame t and the features of the target tracked object in frame (t-1); calculate the Euclidean distance between the detection box of the target detected object in frame t and the prediction box of the target tracked object in frame t; and calculate the area intersection-union ratio between the detection box of the target detected object in frame t and the prediction box of the target tracked object in frame t.

[0139] The comparison module is configured to: when the Euclidean distance is less than a set threshold, identify the detection box of the target object in the t-th frame image as a neighboring detection box; when the Euclidean distance is greater than the set threshold, identify the detection box of the target object in the t-th frame image as a distant detection box.

[0140] The output module is configured to: determine a first cost matrix based on the Euclidean distance and the area intersection-union ratio; input the first cost matrix into the Hungarian algorithm for neighboring detection boxes to obtain the first tracking trajectory; determine a second cost matrix based on the area intersection-union ratio and the cosine distance; input the second cost matrix into the Hungarian algorithm for distant detection boxes and target tracking objects that failed to match in the first attempt to obtain the second tracking trajectory; merge the first tracking trajectory and the second tracking trajectory to obtain the tracking trajectory of the target tracking object in the t-th frame image, and update the target tracking object list.

[0141] It should be noted that the detection box and prediction box acquisition module, calculation module, comparison module, and output module mentioned above correspond to steps S101 to S104 in Embodiment 1. The examples and application scenarios implemented by the above modules and corresponding steps are the same, but are not limited to the content disclosed in Embodiment 1. It should be noted that the above modules, as part of the system, can be executed in a computer system such as a set of computer-executable instructions.

[0142] The descriptions of each embodiment in the above embodiments have different focuses. For parts not described in detail in a certain embodiment, please refer to the relevant descriptions in other embodiments.

[0143] The proposed system can be implemented in other ways. For example, the system embodiments described above are merely illustrative, and the division of modules described above is only a logical functional division. In actual implementation, there may be other division methods. For example, multiple modules may be combined or integrated into another system, or some features may be ignored or not executed.

[0144] Example 3

[0145] This embodiment also provides an electronic device, including: one or more processors, one or more memories, and one or more computer programs; wherein, the processor is connected to the memory, and the one or more computer programs are stored in the memory. When the electronic device is running, the processor executes the one or more computer programs stored in the memory to cause the electronic device to perform the method described in Embodiment 1.

[0146] It should be understood that in this embodiment, the processor can be a central processing unit (CPU), or it can be other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor can be a microprocessor or any conventional processor, etc.

[0147] Memory may include read-only memory and random access memory, and provides instructions and data to the processor. A portion of memory may also include non-volatile random access memory. For example, memory may also store information about the device type.

[0148] In the implementation process, each step of the above method can be completed by the integrated logic circuits in the processor hardware or by software instructions.

[0149] The method in Embodiment 1 can be directly implemented by a hardware processor, or implemented by a combination of hardware and software modules within the processor. The software modules can reside in readily available storage media in the art, such as random access memory, flash memory, read-only memory, programmable read-only memory, electrically erasable programmable memory, or registers. This storage medium is located in memory; the processor reads information from the memory and, in conjunction with its hardware, completes the steps of the above method. To avoid repetition, a detailed description is not provided here.

[0150] Those skilled in the art will recognize that the units and algorithm steps described in connection with the various examples of this embodiment can be implemented in electronic hardware or a combination of computer software and electronic hardware. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this invention.

[0151] Example 4

[0152] This embodiment also provides a computer-readable storage medium for storing computer instructions, which, when executed by a processor, complete the method described in Embodiment 1.

[0153] The above description is merely a preferred embodiment of the present invention and is not intended to limit the invention. Various modifications and variations can be made to the present invention by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the scope of protection of the present invention.

Claims

1. A method for multi-target rotation tracking at traffic intersections based on unmanned aerial vehicles (UAVs), characterized in that, include: Calculate the cosine distance between the features of the target detected in frame t and the features of the target tracked in frame t-1; Calculate the Euclidean distance between the detection bounding box of the target object in the t-th frame image and the predicted bounding box of the target tracked object in the t-th frame image; Calculate the area intersection-union ratio (IU / U) between the detected bounding box of the target object in frame t and the predicted bounding box of the target tracked object in frame t. When the Euclidean distance is less than the set threshold, the detection box of the target object in the t-th frame image is identified as a neighboring detection box; when the Euclidean distance is greater than the set threshold, the detection box of the target object in the t-th frame image is identified as a distant detection box. The first cost matrix is determined based on the Euclidean distance and the area intersection-union ratio; wherein, the calculation process of the first cost matrix is as follows: in, Represents the Euclidean distance matrix. This represents the area intersection-to-union cost matrix. Represents the balance coefficient. Indicates the Euclidean distance threshold. Indicates the area intersection-to-union threshold. Represents the first cost matrix; For the nearest neighbor detection box, the first cost matrix is input into the Hungarian algorithm to obtain the first tracking trajectory; the second cost matrix is determined based on the area intersection-union ratio and the cosine distance; wherein, the calculation process of the second cost matrix is as follows: Pick and The smaller of the two constitutes the second cost matrix. : in, This represents the area intersection-to-union cost matrix. Represents the cosine distance matrix; For the alienated detection box and the target tracking object that failed the first match, the second cost matrix is input into the Hungarian algorithm to obtain the second tracking trajectory; the first tracking trajectory and the second tracking trajectory are merged to obtain the tracking trajectory of the target tracking object in the t-th frame image, and the target tracking object list is updated.

2. The method for multi-target rotation tracking at traffic intersections based on unmanned aerial vehicles (UAVs) as described in claim 1, characterized in that, Before calculating the cosine distance between the features of the detected object in frame t and the features of the tracked object in frame t-1, the following steps are also included: The traffic video stream is input frame by frame into the rotating target detector. All traffic participants in each frame are considered as target objects, resulting in a rotating detection box for each target object. The target object images in each detection box are rotated and aligned. The aligned target objects are then input into a trained rotation attention similarity learning network to extract the features of the target objects. The historical trajectory of the target tracking object in frame t-1 is input into the Kalman filter algorithm to obtain the predicted box of the target tracking object in frame t.

3. The method for multi-target rotation tracking at traffic intersections based on unmanned aerial vehicles (UAVs) as described in claim 2, characterized in that, The aligned target detection objects are input into the trained rotation attention similarity learning network to extract the features of the target detection objects, specifically including: For each target object in the t-th frame image, feature maps are obtained; The feature map output by the feature extraction network is split into several slices according to channels; For each slice, add a positional code to obtain a sequence of feature maps with positional codes; The feature map sequence with positional encoding is input into the self-attention mechanism module; The output values of the self-attention mechanism module are concatenated and spliced together to obtain the spliced features; The stitched features are input into the fully connected layer to obtain the feature vector of the current target object in the t-th frame image; Similarly, feature vectors are extracted from all target objects in the t-th frame image to obtain the feature vectors corresponding to all target objects in the t-th frame image.

4. The method for multi-target rotation tracking at traffic intersections based on unmanned aerial vehicles (UAVs) as described in claim 1, characterized in that, Update the list of target tracking objects, specifically including: For the first frame, all detected objects are considered as candidate tracking objects and stored in a temporary table. It is then determined whether the candidate tracking objects in the temporary table reappear in the next two consecutive frames. If they do reappear, the candidate tracking objects are identified as target tracking objects, and their features and numbers are stored in the target tracking object list. Otherwise, the candidate tracking objects are deleted from the temporary table. For images not in the first frame, determine whether each target detection object has been successfully matched. If so, it means that the current target detection object is already a target tracking object, and the detection box and its features of the current target are added to the target tracking list. If not, it means that the current target detection object is a new candidate tracking object. The new candidate tracking object is stored in a temporary table. It is determined whether the new candidate tracking object in the temporary table reappears in the next two consecutive frames. If it reappears, the new candidate tracking object is identified as a target tracking object, and the historical trajectory, features, and number of the target tracking object are stored in the target tracking object list. If it does not reappear, the new candidate tracking object is deleted from the temporary table. If a target tracking object does not appear in the target tracking object list for M consecutive frames, it is removed from the target tracking object list, where M is a positive integer.

5. The method for multi-target rotation tracking at traffic intersections based on unmanned aerial vehicles (UAVs) as described in claim 1, characterized in that, The process of obtaining the first tracking trajectory further includes: if the first match is successful, connecting the neighbor detection boxes of the target detection object in frame t with the target tracking object in frame t-1 to obtain the first tracking trajectory; if the first match fails, considering the neighbor detection boxes with a confidence level higher than a set threshold as new candidate tracking objects, and storing the new candidate tracking objects in a temporary table. The process of obtaining the second tracking trajectory further includes: if the second match is successful, connecting the alienation detection box of the target detection object in frame t with the target tracking object in frame t-1 to obtain the second tracking trajectory; if the second match fails, considering the alienation detection box with a confidence level higher than a set threshold as a new candidate tracking object, and storing the new candidate tracking object in a temporary table.

6. A multi-target rotation tracking system for traffic intersections based on unmanned aerial vehicles (UAVs), characterized in that, include: The detection box and prediction box acquisition module is configured to: input the traffic video stream frame by frame into the rotating target detector, treat all traffic participants in each frame as target detection objects, and obtain the detection box of the target detection object in each frame; The target object image in each detection box is rotated and aligned. The aligned target object is then input into the trained rotation attention similarity learning network to extract the features of the target object. The historical trajectory of the target object in frame t-1 is input into the Kalman filter algorithm to obtain the predicted bounding box of the target object in frame t. The calculation module is configured to calculate the cosine distance between the features of the target detection object in frame t and the features of the target tracking object in frame t-1. Calculate the Euclidean distance between the detection bounding box of the target object in the t-th frame image and the predicted bounding box of the target tracked object in the t-th frame image; Calculate the area intersection-union ratio (IU / U) between the detected bounding box of the target object in frame t and the predicted bounding box of the target tracked object in frame t. The comparison module is configured to: when the Euclidean distance is less than a set threshold, identify the detection box of the target object in the t-th frame image as a neighboring detection box; when the Euclidean distance is greater than the set threshold, identify the detection box of the target object in the t-th frame image as a distant detection box. The output module is configured to: determine a first cost matrix based on the Euclidean distance and the area intersection-union ratio; wherein the calculation process of the first cost matrix is as follows: in, Represents the Euclidean distance matrix. This represents the area intersection-to-union cost matrix. Represents the balance coefficient. Indicates the Euclidean distance threshold. Indicates the area intersection-to-union threshold. Represents the first cost matrix; For the nearest neighbor detection box, the first cost matrix is input into the Hungarian algorithm to obtain the first tracking trajectory; the second cost matrix is determined based on the area intersection-union ratio and the cosine distance; wherein, the calculation process of the second cost matrix is as follows: Pick and The smaller of the two constitutes the second cost matrix. : in, This represents the area intersection-to-union cost matrix. Represents the cosine distance matrix; For the alienated detection box and the target tracking object that failed the first match, the second cost matrix is input into the Hungarian algorithm to obtain the second tracking trajectory; the first tracking trajectory and the second tracking trajectory are merged to obtain the tracking trajectory of the target tracking object in the t-th frame image, and the target tracking object list is updated.

7. An electronic device, characterized in that it comprises: Memory is used to store computer-readable instructions in a non-transitory manner. as well as Processor, for executing the computer-readable instructions, When the computer-readable instructions are executed by the processor, they perform the method described in any one of claims 1-5.

8. A storage medium characterized by being non-transitory. The system stores computer-readable instructions, wherein, when the non-transitory computer-readable instructions are executed by a computer, the instructions of the method according to any one of claims 1-5 are executed.