A vehicle identification statistical method based on a multi-attention mechanism network
By using a vehicle recognition and statistics method based on a multi-attention mechanism network, combined with the LMNet detector of YOLOv5 and HRNet and the improved Strong SORT network, the problems of missed detection and false detection in UAV vehicle flow statistics are solved, and efficient vehicle flow statistics are achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- UNIV OF SCI & TECH BEIJING
- Filing Date
- 2022-07-01
- Publication Date
- 2026-06-12
AI Technical Summary
Existing drone-based online detection and tracking technologies are prone to missed detections and false detections in traffic flow statistics, resulting in poor tracking performance and affecting vehicle identification switching and statistical results.
A vehicle recognition and statistics method based on a multi-attention mechanism network is adopted. Vehicle detection is performed by combining YOLOv5 and HRNet in a stepped multi-attention network LMNet. The improved Strong SORT network is used for position tracking, and a virtual counting line is drawn in the drone aerial photography scene to count traffic flow.
It improved detection accuracy, reduced false detection and false negative rates, enhanced vehicle tracking performance, and enabled end-to-end traffic flow statistics.
Smart Images

Figure CN115661683B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the fields of vehicle detection, vehicle tracking, and vehicle counting technologies, and in particular to a vehicle recognition and statistical method based on a multi-attention mechanism network. Background Technology
[0002] In intelligent transportation systems, real-time and accurate road traffic flow statistics enable traffic management departments to allocate resources rationally, improve road traffic efficiency, and effectively prevent and address urban traffic congestion. Video-based road traffic flow statistics generally include two parts: vehicle target detection and vehicle tracking. Traditional target detection algorithms typically utilize manually constructed target features and employ classification algorithms to determine the presence of targets. However, traditional algorithms have some significant drawbacks, such as low detection efficiency, high resource consumption, and low robustness and poor generalization of manually designed features, resulting in high false positive and false negative rates. With the emergence of deep learning-based detection algorithms, detection and tracking performance has been significantly improved.
[0003] Deep learning-based target tracking algorithms are primarily detection-based tracking. The detector pre-detects targets in each frame of the image, and then tracks the detected targets. Therefore, the quality of the detection significantly affects the tracking performance, thus impacting the counting results. Considering the real-time nature and flexibility of road traffic flow statistics, online detection and tracking based on unmanned aerial vehicles (UAVs) is a more practical method for traffic flow statistics.
[0004] However, existing drone-based online detection and tracking technologies are prone to missed detections and false detections due to factors such as motion blur, vehicle occlusion, and large changes in target size in aerial traffic statistics. This can affect the tracking effect, lead to the problem of tracked vehicle identity switching, and ultimately affect the statistical results. Summary of the Invention
[0005] This invention provides a vehicle recognition and statistics method based on a multi-attention mechanism network to solve the technical problem that existing detection and tracking technologies are prone to missed detections and false detections, which in turn affect the tracking effect, cause the tracking vehicle identity to switch, and ultimately affect the statistical results.
[0006] To solve the above-mentioned technical problems, the present invention provides the following technical solution:
[0007] On one hand, the present invention provides a vehicle recognition and statistics method based on a multi-attention mechanism network, the vehicle recognition and statistics method based on a multi-attention mechanism network comprising:
[0008] To obtain real-time video footage of roads requiring traffic flow statistics from drone aerial photography;
[0009] A ladder-like multi-attention network LMNet, designed based on YOLOv5 and HRNet, is used as a detector to detect vehicle targets in the real-time video and obtain the position information of the vehicles.
[0010] By utilizing the improved Strong SORT network, the corresponding vehicles are tracked based on the acquired vehicle location information, thereby obtaining the movement trajectory of the corresponding vehicles.
[0011] Virtual counting lines are drawn in road scenes captured by drones. Based on the acquired vehicle movement trajectories passing through the virtual counting lines, the number of vehicles is counted, and the traffic flow statistics for the corresponding road are obtained.
[0012] Furthermore, the stepped multi-attention network LMNet adopts a one-stage structure. The LMNet network includes an input terminal, a backbone network, a neck network layer, and a head output terminal; wherein,
[0013] The backbone network uses HRNet-W40; the network input has a resolution of 512×512, and the head output has a total of four prediction output heads, which are 128×128, 64×64, 32×32, and 16×16 respectively; spatial attention mechanisms are added before stage2_1 and stage2_2 of HRNet-W40, and channel attention mechanisms are added before FPN1, FPN2, FPN3, and FPN4 of the Neck network layer.
[0014] Furthermore, the anchor box loss function in the LMNet network adopts the EIOU LOSS loss function.
[0015] Furthermore, the loss function consists of three parts: confidence loss, category loss, and bounding box coordinate loss.
[0016] Furthermore, the process by which the LMNet network detects vehicle targets in real-time video includes:
[0017] The input terminal performs Mosaic enhancement, adaptive anchor box calculation, and adaptive image scaling on the input data. Specifically, Mosaic enhancement involves stitching together four new training images through random scaling, cropping, and arrangement, and then performing Mosaic enhancement on the images. Adaptive anchor box calculation is used to adjust the size and proportion of the initial anchor boxes. Adaptive image scaling uniformly scales the original images to a standard size and obtains feature maps.
[0018] After processing at the Input stage, the 512×512×3 image is fed into the Backbone backbone network for feature extraction. The Backbone backbone network uses HRNet-w40 as the feature extraction network, and the specific processing steps of the HRNet-w40 network are as follows:
[0019] Stage 1: First, the input image is convolved twice using a 3×3 convolution operation with a stride of 2, so that the height H and width W of the image become H / 4 and W / 4 respectively; then, a dense convolution block is used for processing to extract features; the output feature map of size [128, 128, 256] is passed through two SAM modules and enters stage 2 respectively.
[0020] Stage 2: First, a low-resolution branch is generated based on the previous stage. Then, each branch is used to extract features using a dense convolutional block. Finally, multi-size fusion is performed to obtain the final output. The generated feature map sizes are [128, 128, 32] and [64, 64, 64], respectively. Then, proceed to stage 3.
[0021] Stage 3: Based on the previous stage, a low-resolution branch is generated. Then, each branch is used to extract features using a dense convolutional block. Finally, multi-scale fusion is repeated, and the output sizes are [128, 128, 32], [64, 64, 64], and [32, 32, 128], respectively, which are input into stage 4.
[0022] Stage 4: Based on the previous stage, a low-resolution branch is generated. Then, each branch is used to extract features using a dense convolutional block. Finally, multi-scale fusion is repeated to obtain feature maps of sizes [128, 128, 32], [64, 64, 64], [32, 32, 128], and [16, 16, 256], respectively. These feature maps are then input into the Neck network layer through four channel attention mechanism modules.
[0023] The Neck network layer consists of an FPN structure, which is top-down. It passes and fuses feature information from higher levels through upsampling to obtain the feature map for prediction.
[0024] The Head output is used for the final detection part. Output detection heads with different scaling scales are used to detect target vehicles of different sizes, generate prediction boxes on the feature map, and generate class probability and confidence information. The Head output receives the feature layer outputs of four different dimensions of the Neck network layer, and then uses the EIOU_LOSS loss function to predict the location information and confidence of the vehicle target in the image to obtain the vehicle location information.
[0025] Furthermore, the improved Strong SORT network uses apparent feature cosine distance to calculate the distance between two frames.
[0026] Furthermore, the appearance feature extractor of the appearance branch of the improved Strong SORT network replaces ResNeSt50 with a wide residual network (WRN) in the re-identification domain.
[0027] Furthermore, when marking virtual counting lines in road scenes captured by drones, a virtual counting line is marked for each road in different directions, and the virtual counting lines on roads in different directions remain parallel to each other.
[0028] In another aspect, the present invention also provides an electronic device comprising a processor and a memory; wherein the memory stores at least one instruction, which is loaded and executed by the processor to implement the above-described method.
[0029] In another aspect, the present invention also provides a computer-readable storage medium storing at least one instruction that is loaded and executed by a processor to implement the above-described method.
[0030] The beneficial effects of the technical solution provided by this invention include at least the following:
[0031] This invention draws inspiration from YOLOv5 and HRNet to design a ladder-type multi-attention network (LMNet) as a detector. Combined with an improved Strong SORT target tracking traffic flow statistics algorithm, it achieves end-to-end detection and statistics. Compared to existing technologies, this invention effectively improves detection accuracy, reduces false positive and false negative rates, and enhances tracking performance. Attached Figure Description
[0032] To more clearly illustrate the technical solutions in the embodiments of the present invention, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0033] Figure 1 This is a schematic diagram of the execution flow of the vehicle recognition and statistics method based on a multi-attention mechanism network provided in an embodiment of the present invention;
[0034] Figure 2 This is a schematic diagram of the LMNet network structure provided in an embodiment of the present invention;
[0035] Figure 3This is a schematic diagram of the SAM spatial attention mechanism module provided in an embodiment of the present invention;
[0036] Figure 4 This is a schematic diagram of the SE_Block channel attention mechanism module provided in an embodiment of the present invention;
[0037] Figure 5 This is a diagram of the Strong SORT structure;
[0038] Figure 6 This is an improved Strong SORT structure diagram provided in an embodiment of the present invention. Detailed Implementation
[0039] To make the objectives, technical solutions, and advantages of the present invention clearer, the embodiments of the present invention will be described in further detail below with reference to the accompanying drawings.
[0040] First Embodiment
[0041] This embodiment provides a vehicle recognition and statistics method based on a multi-attention mechanism network. This method uses drone aerial photography to track and detect vehicles and count road vehicles. Drone-based road vehicle statistics methods offer high flexibility, enabling real-time assessment of traffic flow even at intersections lacking road monitoring. Furthermore, drone aerial video is easily transmitted over a network. This method can be implemented using an electronic device, such as a terminal or a server. The execution flow of this method is as follows: Figure 1 As shown, it includes the following steps:
[0042] S1, acquire real-time video of roads where traffic flow needs to be counted, taken by drone;
[0043] S2, using LMNet, a ladder-style multi-attention network designed based on YOLOv5 and HRNet, as a detector, to detect vehicle targets in the real-time video and obtain the position information of the vehicles;
[0044] S3 utilizes an improved Strong SORT network to track the position of the corresponding vehicle based on the acquired vehicle position information, thereby obtaining the movement trajectory of the corresponding vehicle.
[0045] S4. In the road scene captured by the drone, a virtual counting line is drawn. Based on the acquired vehicle movement trajectory passing through the virtual counting line, the number of vehicles is counted to obtain the traffic flow statistics on the road.
[0046] It should be noted that, in this embodiment, when the above steps are used to draw virtual counting lines in the road scene captured by the drone, a virtual counting line is drawn for each road in different directions to count the number of vehicles in all lanes, and the virtual counting lines for roads in different directions are kept parallel to each other.
[0047] Specifically, please refer to Figures 2 to 4 The LMNet in this embodiment is obtained by improving the YOLOv5 network. It adopts a one-stage structure and consists of four parts: an input layer, a backbone network, a neck network layer, and a head output layer. Specifically, the LMNet network is described below:
[0048] The backbone uses HRNet-W40. Compared to other backbone networks, HRNet can retain more detailed information, which is helpful for detecting small targets such as aerial vehicles. The anchor box loss function in the LMNet network is the EIOU LOSS loss function. Compared to other target box regression loss functions, EIOU LOSS considers the overlap area, center point distance, and true difference of length, width and side length, which significantly improves the detection performance. The network input uses a resolution of 512×512. The head has a total of four prediction output heads, which are 128×128, 64×64, 32×32 and 16×16 respectively. It can detect vehicles of different sizes very well and greatly reduce the false detection rate and false negative rate. In order to enhance feature extraction and reduce redundant information, a spatial attention mechanism (SAM) is added before stage2_1 and stage2_2 of HRNet-W40, and a channel attention mechanism module SE_Block is added before FPN1, FPN2, FPN3 and FPN4 of the Neck. The structure of the above LMNet network is as follows: Figure 2 As shown; the SAM spatial attention mechanism module is as follows Figure 3 As shown; the SE_Block channel attention mechanism module is as follows Figure 4 As shown.
[0049] The vehicle detection process using the aforementioned LMNet network model as a detector includes:
[0050] The Input performs Mosaic enhancement, adaptive anchor box calculation, and adaptive image scaling on the input data. Mosaic enhancement involves stitching together four new training images using random scaling, cropping, and arrangement, and then applying Mosaic enhancement to these images. Adaptive anchor box calculation adjusts the size and proportion of the initial anchor boxes. Adaptive image scaling uniformly scales the original images to a standard size and obtains feature maps. After these operations on the Input, a 512×512×3 image is fed into the Backbone for feature extraction.
[0051] Backbone uses HRNet-w40 as the feature extraction network. The processing steps of HRNet-w40 are as follows:
[0052] Stage 1: First, the input image is convolved twice using a 3×3 convolution operation with a stride of 2, so that the height (H) and width (W) of the image become H / 4 and W / 4 respectively. Then, a dense convolution block is used for processing to extract features. The output feature map is of size ([128, 128, 256]) and is fed into stage 2 through two SAM modules respectively;
[0053] Stage 2: First, a low-resolution branch is generated based on the previous stage. Then, each branch is used to extract features using a dense convolutional block. Finally, multi-size fusion is performed to obtain the final output. The generated feature map sizes are ([128, 128, 32], [64, 64, 64]), and then proceed to stage 3.
[0054] Stage 3: Based on the previous stage, a low-resolution branch is generated. Then, each branch is used to extract features using a dense convolutional block. Finally, multi-scale fusion is repeated, and the output sizes are ([128, 128, 32], [64, 64, 64], [32, 32, 128]), which are input into stage 4 respectively.
[0055] Stage 4: Based on the previous stage, a low-resolution branch is generated. Then, each branch is used to extract features using a dense convolutional block. Finally, multi-scale fusion is repeated to obtain feature maps of sizes ([128, 128, 32], [64, 64, 64], [32, 32, 128], [16, 16, 256]), which are denoted as Stage 4_1, Stage 4_2, Stage 4_3, and Stage 4_4, respectively. These feature maps are then input into the Neck through four SE_Block modules.
[0056] The Neck network consists of an FPN structure, which is top-down. It passes and fuses high-level feature information through upsampling to obtain the feature map for prediction.
[0057] The head end is mainly used for the final detection part. These output detection heads with different scaling scales are used to detect target vehicles of different sizes. It generates predicted boxes on the feature map and generates class probability and confidence information. In this embodiment, the EIOU_LOSS loss function is used as the predicted box regression loss function of the LMNet network algorithm. EIOU_LOSS considers the overlap area, center point distance, true difference of length and width sides, solves the fuzzy definition of aspect ratio based on CIOU, and adds Focal Loss to solve the sample imbalance problem in BBox regression. The LMNet network loss function consists of confidence loss (Lconf ), category loss (L) cla ) and bounding box coordinate loss (L EIoU It consists of three parts, and the formula is as follows: L total =L conf +L cla +L EioU .
[0058] The output receives the feature layer outputs of four different dimensions of the Neck network: (128×128×256), (64×64×256), D3 (32×32×256), and D3 (16×16×256). Then, the loss function is used to predict the location information and confidence level of the vehicle target in the image to obtain the vehicle location information.
[0059] In the traffic flow statistics method, the aforementioned LMNet network detection model serves as a vehicle target detector, extracting vehicles from urban road scenes and obtaining vehicle detection box location information. The improved Strong Sort algorithm uses an NSA Kalman filter to predict the state of the vehicle detection box in the next frame, and uses a Hungarian algorithm to match the predicted state with the detection result of the next frame. Then, the NSA Kalman filter is updated, thereby achieving vehicle tracking. The algorithm predicts the trajectory of the vehicle in the next frame, where the trajectory includes several lines. These trajectories are detected, and valid trajectories are retained. Vanilla global linear matching is performed on the valid trajectories to obtain the vehicle's motion trajectory. Strong Sort is an improvement on Deep SORT target tracking and utilizes WRN depth extraction to extract appearance information, significantly improving the tracking performance of occluded targets.
[0060] It's worth noting that the SORT algorithm, based on traditional algorithms, uses Kalman filtering to process the correlation of each frame and the Hungarian algorithm for correlation measurement, resulting in a performance improvement of several tens of times. However, the SORT algorithm suffers from frequent ID switching, meaning it's only suitable for objects with minimal occlusion and relatively stable motion. Deep SORT achieves correlation measurement by combining action and appearance information for more accurate metrics. It uses a CNN network to extract features, increasing robustness to missing and occluded data. Furthermore, it's easy to implement, efficient, and suitable for online scenarios.
[0061] Strong SORT's improvements over Deep SORT are mainly reflected in two branches, such as Figure 5The lower half of the diagram is shown. For the appearance branch, a more powerful appearance feature extractor, BoT, is applied to replace the original simple CNN. The Deep SORT feature library is also replaced by a new feature update strategy, which updates using exponential mean shift (EMA), improving matching quality and reducing time consumption. The appearance branch feature extraction algorithm uses ResNeSt50 as its backbone, extracting more discriminative features. For the motion branch, ECC is used for camera motion compensation. Furthermore, ordinary Kalman filtering is easily affected by low-quality detection and ignores information on the detection noise scale. To address this issue, the NSA Kalman algorithm is borrowed, and an adaptive calculation of noise covariance is proposed. In addition, instead of using only appearance feature distance for matching, Vanilla global linear matching is used instead of Deep SORT's cascaded matching.
[0062] Since Strong SORT uses Mahalanobis distance to calculate the distance between two frames, and Mahalanobis distance only measures spatial distance, which can lead to serious identity transformation problems, this embodiment of the improved Strong SORT introduces appearance feature cosine distance to measure the content similarity within the bounding box. Furthermore, to accelerate training, the appearance feature extractor in the appearance branch uses a Wide Residual Network (WRN) from the re-identification domain instead of ResNeSt50. This model can achieve faster training speed with a similar number of parameters while maintaining network performance. The improved Strong SORT is as follows: Figure 6 As shown.
[0063] This embodiment selects the improved Strong SORT tracker as the tracker in the tracking stage. This tracker has deep correlation features, and its tracking performance is based on the accurate detection results of the detector. The inputs are the bounding box location information, confidence score, and image features from the LMNet network detection results. The confidence score is mainly used for bounding box selection, while the bounding box location information and image features are used for matching calculations with the tracker. The prediction module utilizes a Kalman filter, and the update module uses the EMA update strategy for Hungarian algorithm matching.
[0064] To prevent situations where one target covers multiple targets or multiple detectors detect a single target in multi-target tracking, the Strong SORT tracker uses an eight-dimensional state space. As a direct observation model for vehicle targets, (u,v) represents the center coordinates of the vehicle target detection box, r represents the aspect ratio of the vehicle target detection box, and h represents the height of the vehicle target detection box. (u,v,r,h) are the observed variables of the vehicle target state. For prediction information, the algorithm uses an NSA Kalman filter to predict the target trajectory in the next frame. The Hungarian algorithm is then used to match the predicted state with the detection results of the next frame, followed by Kalman filter updates to track moving vehicles.
[0065] For a vehicle tracking trajectory, the trajectory is only valid if a vehicle detection box successfully matches it in consecutive images; otherwise, it is discarded. The matching process between the detection box and the vehicle trajectory can be solved using the Hungarian algorithm. For individual vehicles, occasional detection failures may occur during tracking. To ensure continued tracking of the target, the improved Strong SORT algorithm, after confirming a trajectory's validity, assigns a Vanilla global linear value, using apparent feature cosine distance and motion information for matching, thus forming the vehicle's trajectory. For each trajectory, the number of frames between the last successfully detected frame and the current detected frame is recorded. This counter increments during Kalman filter prediction and is reset to 0 when the trajectory is associated with a measurement. When the number of frames exceeds a set threshold, the target vehicle is considered to have left the current field of view, and the trajectory is deleted. When a detection in the detector cannot match an existing trajectory, a provisional trajectory is generated. If this trajectory cannot be re-matched in adjacent frames, it is deleted. Based on the intersection of the motion trajectory and the virtual counting line, the number of vehicles crossing the virtual counting line can be determined to count the number of vehicles.
[0066] In summary, this embodiment acquires real-time video from a visible light camera; marks traffic lines at intersections where traffic flow needs to be counted; uses an LMNet network model to extract vehicles captured by a drone and obtain their location information; utilizes an improved Strong SORT algorithm to track vehicles based on their location information, thereby acquiring their movement trajectories; and marks virtual counting lines, counting the number of vehicles based on their movement trajectories crossing these lines. This achieves end-to-end detection and counting. Compared to existing technologies, this invention effectively improves detection accuracy, reduces false positive and false negative rates, and enhances tracking performance.
[0067] Second Embodiment
[0068] This embodiment provides an electronic device, which includes a processor and a memory; wherein the memory stores at least one instruction, which is loaded and executed by the processor to implement the method of the first embodiment.
[0069] The electronic device can vary considerably depending on its configuration or performance, and may include one or more processors (central processing units, CPUs) and one or more memories, wherein the memories store at least one instruction that is loaded by the processor and executed in accordance with the above method.
[0070] Fourth embodiment
[0071] This embodiment provides a computer-readable storage medium storing at least one instruction, which is loaded and executed by a processor to implement the method of the first embodiment described above. The computer-readable storage medium may be a ROM, random access memory, CD-ROM, magnetic tape, floppy disk, or optical data storage device, etc. The instruction stored therein can be loaded and executed by a processor in a terminal.
[0072] Furthermore, it should be noted that the present invention can be provided as a method, apparatus, or computer program product. Therefore, embodiments of the present invention can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Moreover, embodiments of the present invention can take the form of a computer program product implemented on one or more computer-usable storage media containing computer-usable program code.
[0073] Embodiments of the present invention are described with reference to flowchart illustrations and / or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, embedded processor, or other programmable data processing terminal device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal device, generate instructions for implementing the flowchart illustrations. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.
[0074] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing terminal device to operate in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The functions specified in one or more boxes. These computer program instructions may also be loaded onto a computer or other programmable data processing terminal equipment to cause a series of operational steps to be performed on the computer or other programmable terminal equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable terminal equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.
[0075] It should also be noted that, in this document, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or terminal device that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or terminal device. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or terminal device that includes said element.
[0076] Finally, it should be noted that the above description represents a preferred embodiment of the present invention. It should be pointed out that although preferred embodiments have been described, those skilled in the art, once they understand the basic inventive concept of the present invention, can make various improvements and modifications without departing from the principles described herein. These improvements and modifications should also be considered within the scope of protection of the present invention. Therefore, the appended claims are intended to be interpreted as including both the preferred embodiments and all changes and modifications falling within the scope of the embodiments of the present invention.
Claims
1. A vehicle recognition and statistical method based on a multi-attention mechanism network, characterized in that, include: To obtain real-time video footage of roads requiring traffic flow statistics from drone aerial photography; A ladder-like multi-attention network LMNet, designed based on YOLOv5 and HRNet, is used as a detector to detect vehicle targets in the real-time video and obtain the position information of the vehicles. By utilizing the improved Strong SORT network, the corresponding vehicles are tracked based on the acquired vehicle location information, thereby obtaining the movement trajectory of the corresponding vehicles. Virtual counting lines are drawn in road scenes captured by drones. Based on the acquired vehicle movement trajectories passing through the virtual counting lines, the number of vehicles is counted to obtain the traffic flow statistics for the corresponding road. The ladder-style multi-attention network LMNet adopts a one-stage structure. The LMNet network includes: an input layer, a backbone network, a neck network layer, and a head output layer. The backbone network uses HRNet-W40; the network input has a resolution of 512×512, and the head output has a total of four prediction output heads, which are 128×128, 64×64, 32×32, and 16×16 respectively; spatial attention mechanisms are added before stage2_1 and stage2_2 of HRNet-W40, and channel attention mechanisms are added before FPN1, FPN2, FPN3, and FPN4 of the Neck network layer; The improved Strong SORT network uses apparent feature cosine distance to calculate the distance between two frames; The appearance feature extractor of the appearance branch of the improved Strong SORT network replaces ResNeSt50 with a wide residual network (WRN) in the re-identification domain.
2. The vehicle recognition and statistical method based on a multi-attention mechanism network as described in claim 1, characterized in that, The anchor box loss function in the LMNet network is the EIOU LOSS loss function.
3. The vehicle recognition and statistical method based on a multi-attention mechanism network as described in claim 2, characterized in that, The loss function consists of three parts: confidence loss, category loss, and bounding box coordinate loss.
4. The vehicle recognition and statistical method based on a multi-attention mechanism network as described in claim 3, characterized in that, The process by which the LMNet network detects vehicle targets in the real-time video includes: The input terminal performs Mosaic enhancement, adaptive anchor box calculation, and adaptive image scaling on the input data. Specifically, Mosaic enhancement involves stitching together four new training images through random scaling, cropping, and arrangement, and then performing Mosaic enhancement on the images. Adaptive anchor box calculation is used to adjust the size and proportion of the initial anchor boxes. Adaptive image scaling uniformly scales the original images to a standard size and obtains feature maps. After processing at the Input stage, the 512×512×3 image is fed into the Backbone backbone network for feature extraction. The Backbone backbone network uses HRNet-w40 as the feature extraction network, and the specific processing steps of the HRNet-w40 network are as follows: Stage 1: First, the input image is convolved twice using a 3×3 convolution operation with a stride of 2, so that the height H and width W of the image become H / 4 and W / 4 respectively; then, a dense convolution block is used for processing to extract features; the output feature map of size [128, 128, 256] is passed through two SAM modules and enters stage 2 respectively. Stage 2: First, a low-resolution branch is generated based on the previous stage. Then, each branch is used to extract features using a dense convolutional block. Finally, multi-size fusion is performed to obtain the final output. The generated feature map sizes are [128, 128, 32] and [64, 64, 64], respectively. Then, proceed to stage 3. Stage 3: Based on the previous stage, a low-resolution branch is generated. Then, each branch is used to extract features using a dense convolutional block. Finally, multi-scale fusion is repeated, and the output sizes are [128, 128, 32], [64, 64, 64], and [32, 32, 128], respectively, which are input into stage 4. Stage 4: Based on the previous stage, a low-resolution branch is generated. Then, each branch is used to extract features using a dense convolutional block. Finally, multi-scale fusion is repeated to obtain feature maps of sizes [128, 128, 32], [64, 64, 64], [32, 32, 128], and [16, 16, 256], respectively. These feature maps are then input into the Neck network layer through four channel attention mechanism modules. The Neck network layer consists of an FPN structure, which is top-down. It passes and fuses feature information from higher levels through upsampling to obtain the feature map for prediction. The Head output is used for the final detection part. Output detection heads with different scaling scales are used to detect target vehicles of different sizes, generate prediction boxes on the feature map, and generate class probability and confidence information. The Head output receives the feature layer outputs of four different dimensions of the Neck network layer, and then uses the EIOU_LOSS loss function to predict the location information and confidence of the vehicle target in the image to obtain the vehicle location information.
5. The vehicle recognition and statistical method based on a multi-attention mechanism network as described in claim 1, characterized in that, When marking virtual counting lines in road scenes captured by drones, a virtual counting line is marked for each road in different directions, and the virtual counting lines on roads in different directions are kept parallel to each other.