An object detection and tracking method for an overview scene

The ByteTrack tracking algorithm, which uses an improved YOLO11 network and an NWD-IoU dynamic adaptive fusion metric mechanism, solves the accuracy and stability problems of detecting and tracking construction site personnel from an overhead perspective, and achieves high-precision detection and continuous trajectory tracking of tiny worker targets.

CN122244756APending Publication Date: 2026-06-19INNER MONGOLIA UNIV OF TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
INNER MONGOLIA UNIV OF TECH
Filing Date
2026-03-18
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing technologies for personnel detection and tracking from an overhead view of construction sites suffer from limitations in target positioning accuracy, severe background noise interference, and frequent trajectory interruptions and identity switching during multi-target tracking.

Method used

An improved YOLO11 network and the ByteTrack tracking algorithm based on the target scale NWD-IoU dynamic adaptive fusion metric mechanism are adopted. The neck network of YOLO11 is reconstructed through the HD-Star module. The feature fusion is performed by combining the star-shaped basic fusion branch unit, the dual-path dynamic cue local-context-aware branch unit and the terminal reorganization unit. The dynamic adaptive fusion metric mechanism is introduced for data association.

Benefits of technology

It improves the detection accuracy of worker movement trajectories in construction sites from an overhead perspective, accurately extracts the edge features of tiny workers in complex backgrounds, and outputs smooth and continuous similarity trajectories, reducing false detections and missed detections.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122244756A_ABST
    Figure CN122244756A_ABST
Patent Text Reader

Abstract

This application provides an object detection and tracking method for overhead view scenes, comprising: performing target detection in an improved YOLO11 algorithm on each video frame of a video frame sequence of a construction site under overhead view, obtaining a set of worker detection boxes carrying confidence scores in the video frames; the improved YOLO11 algorithm is a model that reconstructs the neck network of YOLO11 using the HD-Star module; employing the ByteTrack tracking algorithm, which introduces a target-scale-based NWD-IoU dynamic adaptive fusion metric mechanism, to associate the set of worker detection boxes carrying confidence scores in the video frames with a set of historical worker motion trajectories, obtaining a set of worker motion trajectories carrying labels in the video frames; and integrating the set of worker motion trajectories carrying labels in all video frames in a temporal sequence to obtain the complete motion trajectory of each worker. This application can improve the detection accuracy of worker motion trajectories in construction sites under overhead view scenes.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of personnel detection at construction sites, and includes, but is not limited to, a method for object detection and tracking in overhead scenes. Background Technology

[0002] Currently, in construction site worker detection scenarios, mainstream technologies mostly rely on single-stage object detection networks such as the YOLO series, primarily achieving multi-scale feature fusion through feature pyramids (e.g., FPN / PANet) and simple channel concatenation operations. However, when the application scenario shifts to an overhead view, these conventional solutions reveal significant limitations. First, worker targets are small in an overhead view, typically only appearing as hard hats or shoulder outlines. Conventional networks, after multiple step-downsampling steps, severely lose local texture details of these tiny targets. Simultaneously, simple concat operations cannot effectively bridge the semantic gap between shallow high-resolution features and deep high-semantic features, significantly limiting the network's accuracy in locating small targets. Second, the background of construction sites is extremely complex. Scaffolding, scattered building materials, or machinery are easily visually confused with worker features in an overhead view; for example, yellow machinery parts might be misidentified as hard hats. Existing conventional attention mechanisms often lack background filtering capabilities with specific target priors, easily introducing global noise and causing serious false positives and false negatives.

[0003] Furthermore, in the multiple object tracking (MOT) stage corresponding to construction site worker detection, the mainstream ByteTrack algorithm has achieved excellent performance in general scenarios due to its strategy of associating all detection boxes. Its core data association mechanism highly relies on Kalman filter state prediction and intersection-overlap (IoU) distance measurement. However, when directly applied to construction worker tracking in high-angle overhead scenes, traditional IoU matching reveals extremely fatal flaws. From this perspective, workers are mostly tiny targets occupying only a few pixels, resulting in extremely small absolute areas of the bounding boxes. At this extreme scale, even a slight jitter of one or two pixels in the detection box, such as deformation caused by a worker bending over or waving, can cause a precipitous drop in the IoU value between the target boxes in consecutive frames, even to zero. In addition, construction sites commonly have dense scaffolding obstructions; when small targets are briefly occluded or trajectory prediction deviates slightly, the predicted box and the actual detection box often have no overlap. At this point, traditional IoU metrics become completely ineffective. The system will mistakenly believe that the target has disappeared or that a new target has appeared, leading to severe track fragmentation and frequent identity switching. Summary of the Invention

[0004] Based on the above problems, this application provides an object detection and tracking method for overhead scenes. It aims to improve the accuracy of target detection by using the improved YOLO11 reconstructed from the HD-Star module, and by using the ByteTrack tracking algorithm which introduces a target-scale-based NWD-IoU dynamic adaptive fusion metric mechanism to output a smooth and continuous similarity trajectory, thereby improving the detection accuracy of worker movement trajectories in construction sites under overhead scenes.

[0005] The technical solution of this application embodiment is implemented as follows:

[0006] This application provides an object detection and tracking method for overhead scenes. The method includes: acquiring a video frame sequence of a construction site under overhead view; inputting each video frame in the video frame sequence into an improved YOLO11 for target detection to obtain a set of worker detection boxes carrying confidence scores in the video frame; wherein, the improved YOLO11 is a model that reconstructs the neck network of YOLO11 using a hierarchical dynamic star HD-Star module, and the HD-Star module is a module composed of a star-shaped basic fusion branch unit, a dual-path dynamic cueing local-context-aware branch unit, and an end reorganization unit; using the ByteTrack tracking algorithm, which introduces a target-scale-based NWD-IoU dynamic adaptive fusion metric mechanism, to perform data association between the set of worker detection boxes carrying confidence scores in the video frame and the set of historical worker motion trajectories, to obtain a set of worker motion trajectories carrying labels in the video frame; and concatenating the set of worker motion trajectories carrying labels in all video frames in the video frame sequence in a temporal order to obtain the complete motion trajectory of each worker in the video frame sequence.

[0007] In some embodiments, the improved YOLO11 includes: a backbone network, a neck network having four HD-Star modules, and a detection head; the step of inputting each video frame in the video frame sequence into the improved YOLO11 for target detection to obtain a set of worker detection boxes carrying confidence scores in the video frame includes: for each video frame in the video frame sequence, using the backbone network to sequentially perform multi-layer convolution and downsampling operations on the video frame to obtain feature maps of different scales of the video frame; using the neck network having four HD-Star modules to fuse and enhance the feature maps of different scales of the video frame to obtain enhanced multi-scale features of the video frame; and inputting the enhanced multi-scale features of the video frame into the detection head to obtain a set of worker detection boxes carrying confidence scores in the video frame.

[0008] In some embodiments, the feature maps of different scales of the video frame include: shallow feature maps, mid-level feature maps, and deep feature maps with progressively decreasing resolution; the enhanced multi-scale features of the video frame include: first-scale features, second-scale features, and third-scale features with progressively decreasing resolution; the process of fusing and enhancing the feature maps of different scales of the video frame using the neck network with four HD-Star modules to obtain the enhanced multi-scale features of the video frame includes: using the first upsampling module in the neck network to upsample the deep feature map to obtain a first upsampled map, and using the first HD-Star module to perform cross-layer fusion of the first upsampled map and the mid-level feature map to obtain a first intermediate-state feature map; using the first C3k2 module to extract features from the first intermediate-state feature map to obtain mid-level features, and using the second upsampling module to upsample the mid-level features to obtain a second intermediate-state feature map; and using... The second HD-Star module performs cross-layer fusion of the second intermediate feature map and the shallow feature map to obtain a third intermediate feature map, and uses the second C3k2 module to extract features from the third intermediate feature map to obtain the first scale feature. The first convolution module downsamples the first scale feature to obtain a first downsampled feature, and the third HD-Star module performs cross-layer fusion of the first downsampled feature and the middle-layer feature to obtain a fourth intermediate feature map, and uses the third C3k2 module to extract features from the fourth intermediate feature map to obtain the second scale feature. The second convolution module downsamples the second scale feature to obtain a second downsampled feature, and the fourth HD-Star module performs cross-layer fusion of the second downsampled feature and the deep feature map to obtain a fifth intermediate feature map, and uses the fourth C3k2 module to extract features from the fifth intermediate feature map to obtain the third scale feature.

[0009] In some embodiments, the step of using the first HD-Star module to perform cross-layer fusion of the first upsampled image and the intermediate feature map to obtain a first intermediate feature map includes: using the star-shaped basic fusion branch unit of the first HD-Star module, first performing element-wise multiplication on the first upsampled image after convolution and the intermediate feature map after convolution to obtain fused features, and then performing convolution mapping on the fused features to obtain basic fused features; using the dual-path dynamic cueing local-context-aware branch unit of the first HD-Star module, first performing cross-layer fusion on the first upsampled image after convolution and the intermediate feature map after convolution under two different granularity perception windows. Feature extraction is performed to obtain the dual-granularity intermediate features of the first upsampled image and the dual-granularity intermediate features of the middle-layer feature map. Then, using dynamic channel weights, adaptive recalibration is performed on the dual-granularity intermediate features of the first upsampled image and the dual-granularity intermediate features of the middle-layer feature map to obtain the dual-granularity attention features of the first upsampled image and the dual-granularity attention features of the middle-layer feature map. Using the end reshaping unit of the first HD-Star module, the basic fusion features, the dual-granularity attention features of the first upsampled image, and the dual-granularity attention features of the middle-layer feature map are sequentially concatenated, convolutionally compressed, and RepConv reshaped along the channel dimension to obtain the first intermediate feature map.

[0010] In some embodiments, the dual-path dynamic cueing local-context-aware branch unit includes: two attention branches; the dynamic channel weights include: a first weight of the first upsampled image and a second weight of the intermediate feature map; before adaptively recalibrating the dual-granularity intermediate features of the first upsampled image and the intermediate feature map using the dynamic channel weights to obtain the dual-granularity attention features of the first upsampled image and the intermediate feature map, the unit includes: using the global average pooling layers in the two attention branches to extract environmental context information from the first upsampled image and the intermediate feature map respectively, to obtain the context features of the first upsampled image and the context features of the intermediate feature map; and employing the lightweight multilayer perception in the two attention branches. The algorithm and activation function, respectively, perform linear mapping and nonlinear activation on the context features of the first upsampled image and the context features of the middle layer feature map, to obtain the first weight and the second weight; the step of using dynamic channel weights to adaptively recalibrate the dual-granularity intermediate features of the first upsampled image and the dual-granularity intermediate features of the middle layer feature map to obtain the dual-granularity attention features of the first upsampled image and the dual-granularity attention features of the middle layer feature map includes: using the first weight to adaptively recalibrate the dual-granularity intermediate features of the first upsampled image to obtain the dual-granularity attention features of the first upsampled image, and using the second weight to adaptively recalibrate the dual-granularity intermediate features of the middle layer feature map to obtain the dual-granularity attention features of the middle layer feature map.

[0011] In some embodiments, the construction process of the improved YOLO11 includes: obtaining an initial detection model built by YOLOv11, and reconstructing the neck network of the initial detection model based on the HD-Star module to obtain a detection model to be trained; obtaining multiple videos of construction workers collected from an existing construction site under different lighting conditions and from an overhead view; and training the detection model to be trained based on a joint loss using the image library corresponding to the multiple videos of construction workers to obtain the improved YOLO11; wherein the joint loss includes: binary cross-entropy loss, complete intersection-union loss, and distribution focus loss.

[0012] In some embodiments, the target-scale-based NWD-IoU dynamic adaptive fusion metric mechanism is as follows:

[0013] ;

[0014] ;

[0015] ;

[0016] ;

[0017] ;

[0018] in, The manual detection box for the video frame The prediction detection box of the video frame Similarity between them; prediction detection boxes of the video frames This was predicted using Kalman filtering based on the historical worker movement trajectory set; For adaptive fusion penalty weights; The manual detection box for the video frame The prediction detection box of the video frame The normalized Wasserstein distance between them; It is an exponential function; The manual detection box for the video frame The area of ​​the predicted detection box of the video frame The ratio of intersection to union of areas; The manual detection box for the video frame The prediction detection box of the video frame Wasserstein distance between them; This is the smooth scaling constant; and The manual detection boxes for the video frames are respectively The center coordinates and absolute width and height; and The prediction detection boxes for the video frames are respectively The center coordinates and absolute width and height; and These are the preset minimum bounding box area and the preset maximum bounding box area, respectively; and Manually detected bounding boxes for video frames The area of ​​the predicted detection frame of the video frame The area.

[0019] In some embodiments, when the video frame is the first frame in the video frame sequence, the historical worker motion trajectory set is generated using temporal physical dynamics; when the video frame is not the first frame in the video frame sequence, the worker motion trajectory set in the previous video frame associated with the video frame in the video frame sequence is determined as the historical worker motion trajectory set.

[0020] The beneficial effects of the technical solutions provided in this application include at least the following:

[0021] The object detection and tracking method for overhead scenes provided in this application utilizes an improved YOLO11 to perform object detection on each video frame acquired during the method's execution. Specifically, the improved YOLO11's feature fusion stage abandons the crude channel splicing and linear addition of the traditional YOLO structure, replacing it entirely with the HD-Star module. This HD-Star module consists of a star-shaped basic fusion branch unit, a dual-path dynamic cueing local-context-aware branch unit, and an end-of-line reorganization unit. It creatively combines the non-linear feature multiplication effect of star mapping (Star Operation) with a dynamic conditional cueing mechanism driven by global context (Dynamic... (Prompt), and combined with local-global dual-granularity perception, it can actively and physically remove complex construction site background clutter from the underlying feature space (such as reflective building materials and shadow interference), thereby accurately extracting the edge features of extremely small workers even under the limited perspective of shooting from an extremely high angle, thus improving the target detection accuracy for each video frame; then, in the cross-frame spatiotemporal correlation stage, it completely breaks through the limitation of traditional ByteTrack relying solely on rigid intersection-union ratio for spatial alignment, and innovatively introduces a ByteTrack tracking algorithm based on the target scale NWD-IoU dynamic adaptive fusion metric mechanism. This mechanism can extract the edge features of tiny and extremely deformable workers. The worker bounding box is abstractly modeled as a continuous two-dimensional normal Gaussian distribution. Even if the predicted box and the detection box completely deviate in space due to slight camera shake or sudden changes in worker posture (causing the IoU to instantly drop to zero), this mechanism can still utilize the second-order optimal transmission distance of the Gaussian distribution to output a smooth and continuous similarity trajectory. Thus, this application proposes an end-to-end multi-worker tracking method based on the detection-by-detection paradigm. It has carried out a system-level reconstruction of the underlying mathematical logic of feature perception, background suppression and temporal correlation, which can improve the detection accuracy of worker movement trajectories in construction sites under overhead views.

[0022] It should be understood that the above general description and the following detailed description are exemplary and explanatory only, and are not intended to limit the technical solutions provided in the embodiments of this application. Attached Figure Description

[0023] To more clearly illustrate the technical solutions in the embodiments of this application, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort, wherein:

[0024] Figure 1A flowchart illustrating an object detection and tracking method for overhead scenes provided in this application embodiment;

[0025] Figure 2 A schematic diagram of the structure of an improved YOLO11 provided in an embodiment of this application;

[0026] Figure 3 This is a schematic diagram of the internal structure of an HD-Star module provided in an embodiment of this application;

[0027] Figure 4 This is a schematic diagram of the data flow for trajectory association using the ByteTrack tracking algorithm with a target-scale-based NWD-IoU dynamic adaptive fusion metric mechanism, provided as an embodiment of this application. Detailed Implementation

[0028] To make the objectives, technical solutions, and advantages of the embodiments of this application clearer, the technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, not all embodiments. The following embodiments are used to illustrate this application, but are not intended to limit the scope of this application. Based on the embodiments in this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0029] In the following description, references are made to “some embodiments,” which describe a subset of all possible embodiments. However, it is understood that “some embodiments” may be the same subset or different subsets of all possible embodiments and may be combined with each other without conflict.

[0030] It should be noted that the terms "first, second, and third" used in the embodiments of this application are merely to distinguish similar objects and do not represent a specific ordering of objects. It is understood that "first, second, and third" can be interchanged in a specific order or sequence where permitted, so that the embodiments of this application described herein can be implemented in an order other than that illustrated or described herein.

[0031] It will be understood by those skilled in the art that, unless otherwise defined, all terms used herein (including technical and scientific terms) have the same meaning as commonly understood by one of ordinary skill in the art to which the embodiments of this application pertain. It should also be understood that terms such as those defined in general dictionaries should be understood to have a meaning consistent with their meaning in the context of the prior art, and should not be interpreted in an idealized or overly formal sense unless specifically defined as herein.

[0032] Example 1:

[0033] See Figure 1 The diagram shown is a flowchart illustrating an object detection and tracking method for overhead scenes provided in an embodiment of this application. This object detection and tracking method for overhead scenes provided in an embodiment of this application can be executed by an electronic device, such as a computer or server. Here, in conjunction with... Figure 1 The following explanation is provided:

[0034] Step 101: Obtain the video frame sequence of the construction site under test from an overhead view.

[0035] In some embodiments, a drone mounted on a high-definition gimbal can be used to record live video from a height of 30 meters above the construction site under test, resulting in a video frame sequence. The video frames in this sequence may include: workers, scaffolding, and dust netting at the construction site under test.

[0036] It should be noted that an overhead view refers to viewing the construction site from a higher position. Furthermore, the number of video frames included in the video frame sequence, i.e., its length and corresponding resolution, can be determined according to the actual situation, and this application does not impose any limitations on this.

[0037] Step 102: Input each video frame in the video frame sequence into the improved YOLO11 for target detection to obtain a set of worker detection boxes carrying confidence scores in the video frame.

[0038] The improved YOLO11 is a model that reconstructs the neck network of YOLO11 using a hierarchical dynamic star (HD-Star) module. The HD-Star module is a module composed of a star-based fusion branch unit, a dual-path dynamic cueing local-context-aware branch unit, and an end reorganization unit.

[0039] In some embodiments, the HD-Star module is not a traditional single-path convolution module, but a branched feature fusion structure module composed of a star-shaped basic fusion branch, a dual-path dynamic cueing local-context-aware branch unit, and an end reorganization unit.

[0040] It should be noted that YOLO11 consists of a backbone network, a neck network, and a detection head.

[0041] In some embodiments, the improved YOLO11 includes: a backbone network, a neck network having four of the HD-Star modules, and a detection head. Correspondingly, step 102 can be implemented by steps 1021 to 1023. Figure 1 (not shown in the image)

[0042] Step 1021: For each video frame in the video frame sequence, the backbone network is used to sequentially perform multi-layer convolution and downsampling operations on the video frame to obtain feature maps of different scales of the video frame.

[0043] Step 1022: Using the neck network with 4 HD-Star modules, the feature maps of different scales of the video frame are fused and enhanced to obtain the enhanced multi-scale features of the video frame.

[0044] Step 1023: Input the enhanced multi-scale features of the video frame into the detection head to obtain a set of worker detection boxes carrying confidence scores in the video frame.

[0045] In some embodiments, the feature maps of the video frame at different scales include: shallow feature maps, mid-level feature maps, and deep feature maps with progressively decreasing resolution; the enhanced multi-scale features of the video frame include: first-scale features, second-scale features, and third-scale features with progressively decreasing resolution; correspondingly, referencing Figure 2 The structure of the improved YOLO11 shown is as follows: 201 is the backbone network, 202 is the neck network with 4 HD-Star modules, and 203 is the detection head. The above step 1022 can be achieved through the following steps A1 to A4:

[0046] Step A1: Using the first upsampling module in the neck network, the deep feature map is upsampled to obtain a first upsampled map. Then, using the first HD-Star module, the first upsampled map and the middle feature map are fused across layers to obtain a first intermediate feature map.

[0047] Step A2: Using the first C3k2 module, extract features from the first intermediate feature map to obtain mid-level features. Using the second upsampling module, upsample the mid-level features to obtain a second intermediate feature map. Using the second HD-Star module, fuse the second intermediate feature map and the shallow feature map across layers to obtain a third intermediate feature map. Using the second C3k2 module, extract features from the third intermediate feature map to obtain the first scale feature.

[0048] Step A3: Using the first convolution module, downsample the first scale feature to obtain the first downsampled feature, and using the third HD-Star module, perform cross-layer fusion of the first downsampled feature and the mid-layer feature to obtain the fourth intermediate feature map, and use the third C3k2 module to extract features from the fourth intermediate feature map to obtain the second scale feature.

[0049] Step A4: Using the second convolution module, downsample the second scale feature to obtain the second downsampled feature, and using the fourth HD-Star module, perform cross-layer fusion of the second downsampled feature and the deep feature map to obtain the fifth intermediate feature map, and use the fourth C3k2 module to extract features from the fifth intermediate feature map to obtain the third scale feature.

[0050] here, Figure 2 The “Upsample” shown in 202 is an upsampling module (including the first upsampling module in step A1 and the second upsampling module in step A2). Meanwhile, the “Conv” shown in 202 is a convolution module (including the first convolution module in step A3 and the second convolution module in step A4).

[0051] In addition, continue to refer to Figure 2 The third "Conv" from the top of the backbone network of 201 outputs a shallow feature map (P3), which has the highest resolution and contains the finest information such as the edge of the safety helmet in the video frame; the fourth "Conv" from the top outputs a medium-level feature map (P4), which has a moderate resolution; and the "C2PSA" in the backbone network of 201 outputs a deep feature map (P5), which has the lowest resolution. The specific implementation of the backbone network of 201 is based on existing technology and will not be described in detail here.

[0052] In some embodiments, Figure 2 The neck network of the China 202 system can be divided into two stages based on the bottom-up data flow logic: top-down and bottom-up. It comprises four key HD-Star modules (HD-Star fusion nodes), specifically:

[0053] In the top-down phase: First, the deep feature map (P5) output by 201 is upsampled and then input together with the mid-level feature map (P4) into the first HD-Star module for the first cross-layer fusion. Here, because P5 has a larger receptive field and stronger global semantic expression ability, while P4 retains richer mesoscale structural information, this process can inject deep semantic priors into the mid-level features, thereby generating intermediate feature representations that combine environmental perception and structural discrimination capabilities. Subsequently, the features output by the first HD-Star module are enhanced by C3K2 (corresponding to the intermediate features corresponding to P4) and upsampled again, and then fused with the high-resolution shallow feature map (P3) output by 201 through the second HD-Star module for the second time to obtain the first-scale features corresponding to P3. Here, because P3 retains relatively complete edge texture and local geometric details, the output of this process can better maintain high-frequency information of small targets such as helmet edges and head and shoulder contours, thereby forming a high-resolution feature layer suitable for small target detection.

[0054] In the bottom-up phase, firstly, the first-scale features corresponding to P3 output from the top-down phase are downsampled using a 3x3 Conv algorithm with a stride of 2. These features are then input into the third HD-Star module for reverse aggregation along with the intermediate features corresponding to P4 retained from the top-down phase. The main purpose of this step is to feed back the shallowly enhanced local localization information into the mesoscale semantic space, thereby enhancing the model's ability to express mesoscale targets, partially occluded targets, and targets with scale variations, generating the second-scale features corresponding to P4. Next, the second-scale features corresponding to P4 are again enhanced using C3K2 for feature extraction and downsampled using Conv. Finally, they are fused with the deep feature map (P5) in the fourth HD-Star module at the global level to form the third-scale features corresponding to P5. This layer primarily preserves high-level global semantic information, providing stable contextual support for the entire 202 neck network.

[0055] Here, the four HD-Star modules mentioned above replace the Concat and C3k2 feature processing combination in the original YOLO11 neck network. Unlike traditional fusion methods that rely on channel concatenation and convolution stacking, the HD-Star module performs structured modeling through its internal star-shaped basic fusion branch unit, dual-path dynamic cueing local-context-aware branch unit, and terminal reorganization unit each time high-level semantic features interact with low-level spatial detail features. This is more conducive to strengthening cross-layer consistent response regions, preserving weak features of small targets, and suppressing pseudo-response interference caused by complex construction site backgrounds.

[0056] Correspondingly, regarding the execution operations within the HD-Star module, taking step A1 above, "using the first HD-Star module to perform cross-layer fusion of the first upsampled image and the intermediate feature map to obtain the first intermediate feature map," as an example, it can be implemented through the following steps B1 to B3:

[0057] Step B1: Using the star-shaped basic fusion branch unit of the first HD-Star module, firstly multiply the first upsampled image after convolution operation and the middle layer feature map after convolution operation element-wise to obtain the fused feature, and then perform convolution mapping on the fused feature to obtain the basic fused feature.

[0058] Step B2: Using the dual-path dynamic prompting local-context-aware branch unit of the first HD-Star module, feature extraction is first performed on the first upsampled image after convolution and the middle-layer feature map after convolution under two different granularity perception windows to obtain the dual-granularity intermediate features of the first upsampled image and the dual-granularity intermediate features of the middle-layer feature map. Then, using dynamic channel weights, adaptive recalibration is performed on the dual-granularity intermediate features of the first upsampled image and the dual-granularity intermediate features of the middle-layer feature map to obtain the dual-granularity attention features of the first upsampled image and the dual-granularity attention features of the middle-layer feature map.

[0059] Step B3: Using the end reshaping unit of the first HD-Star module, the basic fusion features, the dual-granularity attention features of the first upsampled image, and the dual-granularity attention features of the intermediate feature map are sequentially spliced, convolutionally compressed, and RepConv reshaped in the channel dimension to obtain the first intermediate feature map.

[0060] In some embodiments, reference may be made to Figure 3 The internal operation flow of the HD-Star module shown, taking 301 as the first upsampled image and 302 as the intermediate feature map as an example, includes the following three main parts:

[0061] Part 1: Execution operations within the Star Operation of the 303 basic fusion branch unit, i.e., multiplicative interaction modeling in the basic fusion branch: In traditional cross-layer feature fusion, the common approach is to directly perform element-wise addition or channel stitching after alignment. However, for overhead scenes, i.e., tiny worker targets occupying only a few pixels in a high-altitude view, this linear fusion method easily drowns out the weak target response in background noise, especially when there is strong interference from complex backgrounds such as steel bars, equipment edges, and reflective areas. To address this, the HD-Star module introduces Star Operation in the basic fusion branch, using element-wise multiplication instead of traditional linear addition to explicitly model the multiplicative coupling relationship between cross-layer features:

[0062] Formula (1);

[0063] in, Features of fusion; The features are obtained after Conv processing (convolution operation) on the first upsampled image of 301. The features are obtained after Conv processing (convolution operation) of the mid-layer feature map of 302; The Star Operation represents the Hadamard Product. Unlike additive fusion, element-wise multiplication emphasizes the consistency of features at different levels within the same spatial location: when a region exhibits strong responses in both high-level semantic features and low-level local detail features, that region will be enhanced after multiplicative mapping; conversely, isolated responses or background noise appearing only in a single path will have their fusion value suppressed. Therefore, the Star Operation is more effective in strengthening cross-layer consistent activation in target-related regions and reducing interference from background spurious responses.

[0064] To further supplement local spatial information, Star Operation performs context extraction by passing the fusion result through a 3x3 convolution, resulting in 306 basic fusion features output by the basic fusion branch. :

[0065] Formula (2);

[0066] in, This represents a local convolutional mapping. This Star Operation branch provides the foundational feature representations for cross-layer multiplicative interactions throughout the HD-Star module.

[0067] Part Two: Dual-Path Dynamic Hints Local-Context-Aware Branch Units, such as Figure 3As shown in Figures 304 and 305, to balance the local details of small targets with broader contextual information, the HD-Star module employs Local-Global Attention (LGA) for dual-granularity feature modeling in each dynamic cue local-context-aware branch unit. The core idea is to extract local high-frequency textures and neighborhood contextual semantics in parallel through patch-aware windows of different scales, thereby achieving a joint representation of multi-scale discriminative information with lower computational overhead.

[0068] For example, such as Figure 3 In 304 and 305 shown, the following are respectively given:

[0069] 1. P=2, that is: fine-grained local perception branch (patch_size=2): it forces the network to perform self-attention calculation within an extremely limited 2*2 local receptive field, to obtain the high-frequency edge response of the arc shape of the tiny worker's safety helmet, and avoid the loss of micro-geometric features in the pooling operation.

[0070] 2. P=4, i.e., coarse-grained global perception branch (patch_size=4): focuses on the perception and verification of the macro-environment surrounding the target. It aggregates a wider range of neighborhood information through a 4*4 window to help verify whether the highlighted pixel block is connected to the dark columnar pixels representing the outline of the worker's body.

[0071] In this way, the HD-Star module can not only sensitively capture the local details of small targets, but also use contextual information from a larger neighborhood for auxiliary discrimination, thereby improving the detection robustness in complex backgrounds.

[0072] In some embodiments, the dual-path dynamic prompt local-context-aware branch unit within the HD-Star module enhances the model's adaptability to complex construction site scene changes through adaptive channel recalibration for environmental changes. Specifically, a Dynamic Prompt mechanism is introduced within each dynamic prompt local-context-aware branch unit, which adaptively generates channel recalibration weights based on the global context of the current input features.

[0073] Correspondingly, the aforementioned dual-path dynamic prompting local-context-aware branch unit includes: two attention branches, and the dynamic channel weights include: a first weight of the first upsampled image and a second weight of the intermediate feature map; that is, before executing step B2 above, "using the dynamic channel weights to adaptively recalibrate the dual-granularity intermediate features of the first upsampled image and the dual-granularity intermediate features of the intermediate feature map to obtain the dual-granularity attention features of the first upsampled image and the dual-granularity attention features of the intermediate feature map", the following steps C1 and C2 can be executed first:

[0074] Step C1: Using the global average pooling layers in the two attention branches, extract the environmental context information from the first upsampled image and the middle feature map respectively, to obtain the context features of the first upsampled image and the context features of the middle feature map.

[0075] Step C2: Using a lightweight multilayer perceptron and activation function in the two attention branches, linear mapping and nonlinear activation are performed on the context features of the first upsampled map and the context features of the middle layer feature map respectively to obtain the first weight and the second weight.

[0076] Here, we take the input features of a certain attention branch (e.g., the first upsampled image) as... First, its environmental context features are extracted through global average pooling. :

[0077] Formula (3);

[0078] Subsequently, this context feature The input to a lightweight multilayer perceptron undergoes linear mapping and nonlinear activation sequentially, resulting in channel weights, i.e., the first weights. :

[0079] Formula (4);

[0080] in, It can also be represented as a dynamic cue mask corresponding to the current input (e.g., the first upsampled image).

[0081] Correspondingly, the step B2 above, "using dynamic channel weights to adaptively recalibrate the dual-granularity intermediate features of the first upsampled image and the dual-granularity intermediate features of the middle-layer feature map to obtain the dual-granularity attention features of the first upsampled image and the dual-granularity attention features of the middle-layer feature map," can be achieved through the following step C3:

[0082] Step C3: Using the first weight, adaptively recalibrate the two-granularity intermediate features of the first upsampled image to obtain the two-granularity attention features of the first upsampled image, and using the second weight, adaptively recalibrate the two-granularity intermediate features of the middle layer feature map to obtain the two-granularity attention features of the middle layer feature map.

[0083] Following the description above, for an intermediate feature in a certain attention branch, such as the two-granularity intermediate feature representation of the first upsampling map... The dual-granularity attention features of the first upsampled image are obtained through channel-by-channel recalibration. :

[0084] Formula (5);

[0085] Here, unlike fixed learnable templates, It is generated in real time from the global context of the current input features, thus enabling it to adaptively adjust the importance of different channels according to specific scene conditions. In complex construction site environments such as strong light, shadow, backlight, loess camouflage, and metallic reflection, it can improve the preservation of effective target features while suppressing invalid channel responses that are strongly correlated with environmental interference, thereby enhancing the discriminative power of the features.

[0086] It should be noted that in the HD-Star module, Dynamic Prompt does not act as a separate, unified pre-module applied to all features. Instead, it is embedded within each attention branch, dynamically recalibrating the intermediate representations of the corresponding branch. This design is more in line with the characteristics of a branched structure and allows different input branches to adaptively adjust according to their respective feature states.

[0087] Part Three: Terminal Reorganization Units, i.e., multi-branch parallel topologies based on RepConv, which respectively address issues such as... Figure 3 The first and second parts, outputting 306, 307, and 308 respectively, are concatenated along the channel dimension in 309. The final output is generated through convolutional compression and RepConv reshaping, ensuring rich gradient information backpropagation and diverse feature extraction. This operation comprises three independent branches: a 3x3 depthwise convolution branch extracts backbone spatial features, a 1x1 pointwise convolution branch enables cross-channel feature interaction, and an identity mapping branch mitigates gradient vanishing. Specifically, during the deployment and inference phases of the improved YOLO11, the weights of these three branches are reparameterized and folded into a single 3x3 convolutional layer through mathematical equivalence transformations. This design allows the module to benefit from the high-precision representation capabilities of multiple branches during training, while achieving relatively lightweight and low-latency computation during inference.

[0088] The construction process of the improved YOLO11 in the above embodiments can be implemented by the following steps D1 to D3:

[0089] Step D1: Obtain the initial detection model built by YOLOv11, and reconstruct the neck network of the initial detection model based on the HD-Star module to obtain the detection model to be trained.

[0090] Step D2: Acquire multiple videos of construction workers from an existing construction site, taken from an overhead view and under different lighting conditions.

[0091] Step D3: Using the image library corresponding to the multiple videos of construction workers, train the detection model to be trained based on joint loss to obtain the improved YOLO11.

[0092] The joint loss includes: binary cross-entropy loss, complete crossover ratio loss, and distribution focus loss.

[0093] In some embodiments, the construction of the improved YOLO11 specifically includes the following aspects:

[0094] The first aspect involves data acquisition and cleaning. Drones equipped with high-definition gimbals can be used to record 1920*1080 resolution, 30FPS video from a height of 30 meters at various lighting conditions at existing construction sites, such as strong sunlight at dawn, midday, and weak sunlight at dusk. To eliminate high redundancy between adjacent frames, offline frame extraction can be performed at a frequency of 1 frame every 5 frames. This removes full-screen blurry images caused by severe drone shaking, ultimately constructing an original image library containing several high-quality still images. This library is then randomly divided into training, validation, and test sets in an 8:1:1 ratio.

[0095] It should be noted that for aerial views, the top of the bounding box must strictly match the highlighted edge of the worker's safety helmet, and the bottom must include the visible outline of the worker's shoulders or torso. For workers partially obscured by scaffolding or dust netting, as long as their safety helmet or some torso features are clearly identifiable, they should be labeled with a separate bounding box to force the network to strengthen its ability to extract local features during the training phase.

[0096] Secondly, the end-to-end training strategy focuses primarily on reconstructing the initial detection model built from YOLOv11 using the HD-Star module to obtain the target detection model. The input image for training is an RGB color image with a default resolution of 640*640. Specific settings are as follows:

[0097] 1. HD-Star module exclusive hyperparameters: LGA receptive field window configuration: fine-grained branch set to patch_size=2, coarse-grained branch set to patch_size=4.

[0098] 2. Data Tensorization and Forward Propagation Initialization: After preprocessing and data augmentation (e.g., Mosaic, random cropping), the 1080P construction site image is uniformly scaled and filled to a standard input resolution of 640*640. Subsequently, the image pixel color values ​​are normalized to the [0, 1] interval and converted into a four-dimensional floating-point tensor (with dimensions of 1 / 2440) that the system can process. ,in, (This is the batch size). This tensor flow serves as the initial data input for forward propagation into the backbone network of the detection model to be trained.

[0099] 3. Multi-scale Feature Extraction in the Backbone Network: Tensor flow undergoes multiple convolutional and downsampling operations sequentially within the backbone network. During this process, features are extracted progressively from shallow to deep layers: shallow features, such as layer P3, retain high-frequency spatial details like the edges and geometric contours of a worker's safety helmet; deep features, such as layer P5, extract low-frequency abstract semantic features of the macroscopic construction site environment, including scaffolding and machinery. These feature maps at different scales are then fed to the neck network in a hierarchical manner.

[0100] 4. Multi-scale feature fusion and background suppression based on HD-Star module: The neck network is the most critical feature reconstruction stage. In the traditional feature pyramid structure, features at different levels are usually fused by channel splicing or element-wise addition. However, this application introduces the HD-Star module in the key fusion node to enhance the discriminative interaction between high-level semantic information and low-level spatial details.

[0101] For the two input features, the HD-Star module first performs channel alignment using 1*1 convolutions to obtain intermediate representation features of uniform dimension. Then, the HD-Star module models the features in parallel along three paths:

[0102] (1) Star Basic Fusion Branch: In the basic fusion path, the HD-Star module uses element-wise multiplication instead of traditional element-wise addition to perform multiplicative interaction modeling on the intermediate representation features of the two paths. This operation can enhance the consistent response of high and low layer features in the same spatial location and suppress isolated noise in a single path, thereby improving the discriminability of small worker targets in complex backgrounds. Subsequently, the fusion result is further refined through convolution to extract local contextual information and form the output of the basic fusion branch.

[0103] (2) Dynamically prompted dual-granularity attention branches: In addition to the basic fusion branch, the HD-Star module also establishes independent attention enhancement paths for the two intermediate representation features. Within each attention enhancement path branch, a dynamic prompting mechanism based on global context is introduced. Specifically, firstly, global average pooling is used to extract the environmental description of the current input features, and then a lightweight multilayer perceptron is used to generate dynamic channel weights. These weights are used to adaptively recalibrate the intermediate features within the branch to improve the response of the effective target channel and reduce the interference caused by complex environmental factors such as shadows, reflective steel, and loess background.

[0104] (3) Dynamic cue-driven dual-granularity attention branch: Based on dynamic cue modulation, a dual-granularity LGA branch is further used for spatial structure modeling. The fine-grained branch uses a smaller patch perception window to preserve small target edges, textures, and local geometric information; the coarse-grained branch uses a larger patch perception window to integrate contextual cues around the target to help distinguish workers from complex background pseudo-targets. The outputs of the two types of branches are finally concatenated along the channel dimension to form a dual-granularity attention enhancement feature.

[0105] After the above modeling, the HD-Star module integrates the two dual-granularity attention features with the Star basic fusion features, and obtains the final output through convolutional compression and RepConv reshaping, providing a more discriminative feature representation for subsequent detection heads.

[0106] 5. Result Mapping of the (Decoupled) Detection Head of the Detection Model to be Trained: After processing by the HD-Star module, the result is fed into the detection head of the detection model to be trained. This detection head abandons the traditional coupled output, physically separating the classification task and the bounding box regression task in the spatial feature dimension, effectively eliminating feature conflicts between multi-task learning. For the spatial location on the feature map, the classification branch is specifically responsible for outputting the confidence probability of the target being a worker (since only workers are detected, it degenerates into a single-class binary probability); the regression branch focuses on outputting the coordinate offset of the predicted bounding box. Thanks to the extremely high purity features provided by the HD-Star module at the front end, the native decoupled head can accurately complete the result mapping of tiny workers with extremely high efficiency.

[0107] 6. Error quantization based on multi-task joint loss function:

[0108] After forward propagation, the mathematical deviation between the predictions of the trained detection model and the actual human annotations can be calculated within the loss function space. The total loss is jointly driven by the multi-task branches of the detection head:

[0109] (1) Classification loss: Binary cross-entropy loss is adopted to strictly punish false positive errors in classification branches that mistake complex backgrounds such as scaffolding and yellow soil for workers.

[0110] (2) Bounding box regression loss: It consists of the complete intersection-union loss and the distributed focus loss. In view of the fact that the small workers in the aerial photography are extremely sensitive to the coordinates, the complete intersection-union loss comprehensively measures the relative consistency of the overlapping area, the Euclidean distance of the center point and the aspect ratio, and accurately quantifies the micro-geometric deviation; while the distributed focus loss transforms the bounding box regression into the prediction of the discrete probability distribution, forcing the network to quickly focus on the pixels near the worker's boundary, which further improves the regression accuracy of the small target's edge box.

[0111] 7. Gradient Backpropagation and Adaptive Update of Network Weights: After calculating the total loss, the backpropagation mechanism is triggered. Following the chain rule, the gradient of the loss function starts from the detection head of the model to be trained, passes through the HD-Star modules layer by layer, and finally reaches the backbone network. In this step, the optimizer updates all convolutional kernel weights in the model to be trained using the calculated gradient matrix and a set learning rate. The final saved model weights possess the ability to accurately capture minute workers even in extremely complex construction scenarios, thus providing an extremely pure and high-confidence source of underlying trajectory coordinate sequences for subsequent research on worker trajectory prediction methods in construction site scenarios.

[0112] Step 103: Using the ByteTrack tracking algorithm, which incorporates a target-scale-based NWD-IoU dynamic adaptive fusion metric mechanism, the set of worker detection boxes carrying confidence scores and the set of historical worker motion trajectories in the video frame are correlated to obtain the set of worker motion trajectories carrying labels in the video frame.

[0113] In some embodiments, the ByteTrack tracking algorithm is a high-performance multi-target tracking algorithm. Its core innovation lies in a simple yet effective data association strategy called BYTE. Unlike previous methods that only utilize high-confidence detection boxes, ByteTrack retains all detection boxes and divides them into two sets—high-scoring and low-scoring—for a two-stage matching process. First, the algorithm matches high-scoring boxes with existing trajectories to update clear targets. Then, for trajectories that fail to match high-scoring boxes, the algorithm allows a second matching with low-scoring boxes. Since low-scoring boxes typically correspond to occluded targets, this secondary matching mechanism effectively recovers the trajectories of occluded targets, significantly reducing missed detections and ID switching problems caused by the temporary disappearance of targets. While maintaining real-time operating speed, the ByteTrack tracking algorithm achieved then-leading tracking accuracy.

[0114] In some embodiments, the NWD-IoU dynamic adaptive fusion metric mechanism based on the target scale is as follows:

[0115] Formula (6);

[0116] Formula (7);

[0117] Formula (8);

[0118] Formula (9);

[0119] Formula (10);

[0120] in, The manual detection box for the video frame The prediction detection box of the video frame Similarity between them; prediction detection boxes of the video frames This was predicted using Kalman filtering based on the historical worker movement trajectory set; For adaptive fusion penalty weights; The manual detection box for the video frame The prediction detection box of the video frame The normalized Wasserstein distance between them; It is an exponential function; The manual detection box for the video frame The area of ​​the predicted detection box of the video frame The ratio of intersection to union of areas; The manual detection box for the video frame The prediction detection box of the video frame Wasserstein distance between them; This is the smooth scaling constant; and The manual detection boxes for the video frames are respectively The center coordinates and absolute width and height; and The prediction detection boxes for the video frames are respectively The center coordinates and absolute width and height; and These are the preset minimum bounding box area and the preset maximum bounding box area, respectively; and Manually detected bounding boxes for video frames The area of ​​the predicted detection frame of the video frame The area.

[0121] In some embodiments, after the front-end improved YOLO11 completes pixel-level processing of video frames and outputs a series of two-dimensional detection bounding boxes carrying confidence scores, these discrete spatial data are transmitted to the association layer of the Multiple Object Tracking (MOT) system for highly robust trajectory association, i.e., the corresponding step 103 is executed.

[0122] In this application, in order to completely break the deadlock of measurement that zero overlap of small targets means loss, this application abandons the traditional discrete measurement method based on set intersection and turns to the statistical theory of probability distribution, and innovatively introduces the normalized Wasserstein distance (NWD) into the construction of the cost matrix.

[0123] It's important to note that the core theoretical insight of the NWD mechanism lies in the fact that the projection of a target onto an image should not be mechanically viewed as a rigid set of discrete rectangles with perfectly equal pixel weights, but rather abstractly modeled as a continuous two-dimensional probability density distribution. This abstraction of geometric continuity lays the mathematical foundation for tolerating minute boundary blurring and coordinate jitter. Here, given a video frame with a center coordinate... and absolute width and height Defined tiny detection bounding box We can take its inscribed ellipse and map it to a two-dimensional normal Gaussian distribution. This involves modeling the bounding box using a two-dimensional normal Gaussian distribution. Furthermore, in this probability model: the mean vector... This determines the physical center of gravity of the target, which is directly taken as the center coordinates of the detection box:

[0124] Formula (11);

[0125] Correspondingly, the covariance matrix This determines the scale evolution and shape broadening of the Gaussian distribution in space. Since the target bounding box is usually axis-aligned, assuming the x-axis and y-axis are independent and have no covariance correlation, the covariance matrix in this case is... Diagonalization to:

[0126] Formula (12);

[0127] Through this profound mathematical abstraction, the contribution weight of each pixel within the bounding box to the target's identity representation is differentiated: pixels in the central region (typically the brightest top feature of a worker's hard hat) are assigned the highest attention weight in the probability distribution, while pixels near the edge of the bounding box (often mixed with a large amount of architectural background or ghosting of limb movement) have their weights smoothly decay outwards according to a Gaussian curve. This greatly aligns with the focusing mechanism of the human visual system when viewing small targets from above.

[0128] Correspondingly, after completing the two-dimensional normal Gaussian distribution modeling of all predicted boxes and manually detected boxes in the video frame, evaluating the physical similarity between the two naturally transforms into calculating the distance between these two probability distributions in the feature space. To avoid the computational overhead caused by complex matrix operations, this application directly uses the dimensionality-reduced and simplified second-order Wasserstein distance as the algebraic equation actually executed at the system's bottom layer, as shown in the above formula (8), where:

[0129] Center point spatial offset This term absolutely quantifies the squared Euclidean distance between the center points of two bounding boxes in a two-dimensional plane. Its core physical significance lies in providing position tolerance. Even if a small target under a high-angle overhead view undergoes rapid displacement, causing the predicted box and the detected box to completely separate in physical coordinates (the intersection area is strictly 0), as long as the centroids of the two targets are still close to each other in space, this term can still output a stable and continuous non-zero distance value, thereby solving the correlation break in the Intersection over Union (IoU) caused by the non-intersecting boundaries.

[0130] Perspective length and width dimension difference This measure precisely quantifies the scale decay caused by perspective changes during target movement. Its core physical significance lies in providing deformation tolerance. In complex construction scenarios, a worker's bending or turning, or even slight camera shake, can cause a drastic change in the aspect ratio of a target only a few pixels in size. This measure introduces aspect ratio changes as a penalty, allowing for a certain degree of pose deformation while effectively filtering out background noise with excessively large size differences, such as mismatching a tiny worker in the distance with a large-scale distraction in the foreground.

[0131] Here, the result calculated by formula (8) Essentially, it is an absolute distance value representing physical differences, which needs to be subjected to a nonlinear transformation as shown in formula (7); where, This is a crucial smooth scaling constant, typically set to a value related to the absolute average size and height of all objects in the current dataset. The exponential function in Equation (7) guarantees that when the distance between two bounding boxes... When it approaches 0 (completely coincident), The response value approaches 1 infinitely; however, as the spatial separation distance gradually increases, The rating does not drop to 0 immediately like IoU, but rather shows a gradual and smooth decline.

[0132] It should be noted that, It is the core parameter for the Wasserstein distance mapping to [0, 1]. In the construction site scenario example, its empirical value is 12.5 (this value is highly correlated with the average absolute pixel area of ​​small targets in the current image).

[0133] In this application, although While it has absolute dominance in associating small targets, industrial sites are not only populated by workers; large cranes or concrete trucks may also be present. For large-scale targets, traditional IoU can more rigorously align target boundaries and reduce drift errors. To enable the algorithm to have universal generalization capabilities across a wide range of scenarios across scales, this scheme further proposes an NWD-IoU fusion metric cost matrix mechanism, corresponding to formula (6): where, It is designed as a dynamic, adaptive fusion penalty weight. During actual inference, it automatically adjusts based on the absolute area of ​​the hand-drawn bounding boxes in the improved YOLO11 prediction output. Size: When the target pixels captured by the detector account for a very small proportion (e.g., workers in a high-altitude scene), the internal size will be assigned... Extremely high weighting coefficients dominate cost calculations, thus absorbing coordinate fluctuations of small targets with a high degree of tolerance; while when the target area expands (e.g., large nearby machinery), the impact is dynamically reduced. The impact and rise The verification weights ensure strict boundary alignment of rigid body objects.

[0134] Here, considering the computing power constraints of edge deployment in industrial settings, to avoid high-frequency nonlinear exponential calculations, this solution will dynamically and adaptively fuse penalty weights. It is constructed as a low-computational-cost piecewise linear mapping function. Based on the absolute area of ​​the target... Size, The calculation formula is shown in formula (9).

[0135] It should be noted that, regarding formula (9) and :

[0136] 1. The preset minimum bounding box area is the micro-scale threshold. Here, when the worker is at the edge of the image, extremely far from the lens, or in a severely occluded state, the effective pixels are very few. ), The value is forced to be 1. At this time... Completely by The takeover mechanism utilizes the continuity of the two-dimensional Gaussian distribution to absorb the coordinate jitter of small targets with a high tolerance, thereby solving the failure problem that the IoU is very easy to return to zero on tiny targets.

[0137] 2. The preset maximum bounding box area is the standard scale threshold. Here, when the worker is directly below the camera (in the center area), close enough, and fully visible, the bounding box area is relatively large. ), The value is forced to 0. At this point, the target has enough pixels to support it. Degenerate into pure Measurements ensure strict boundary alignment of clear worker profiles, reducing positional drift errors caused by excessive smoothing of the Gaussian distribution.

[0138] 3. In the transition zone Inside, based on the area changes caused by the worker's position and posture, [the following is observed]: and Linear weighted fusion is used to ensure the absolute continuity of the correlation metric values ​​and the stability of the algorithm when the same worker moves through the image (from far to near, from the edge to the center).

[0139] Here, to ensure that this application has universal applicability on monitoring devices with different focal lengths and resolutions, this application does not specify... and Instead of using rigid, absolute fixed values, we propose an adaptive dynamic computation strategy based on the input image resolution and the prior statistical distribution of the dataset.

[0140] That is, during the initialization phase, and The value of is related to the physical resolution of the high-definition video stream (or image frame) currently input to the network. Establish a strict proportional mapping relationship. The calculation formula is as follows:

[0141] Formula (13);

[0142] Formula (14);

[0143] In the formula, and This represents a relative area proportion coefficient. In actual engineering deployments, The preferred value range is 0.0005 to 0.0015 (the upper limit of the pixel ratio representing extremely small targets). The preferred value range is 0.008 to 0.012 (representing the lower limit of the pixel ratio of targets at a conventional scale). Through this dynamic decoupling based on global resolution, regardless of whether the front-end camera is a 1080P or 4K camera, the metric division threshold that best matches the current field of view can be automatically derived.

[0144] In practical applications, based on an input scale of 640*640, The upper limit is set at 144. The lower limit is set at 1024.

[0145] In some embodiments, when the video frame is the first frame in the video frame sequence, the set of historical worker movement trajectories is generated using temporal physical dynamics.

[0146] If the video frame is not the first frame in the video frame sequence, the set of worker movement trajectories in the previous video frame associated with the video frame in the video frame sequence is determined as the historical worker movement trajectory set.

[0147] In this application, the overall execution time of the ByteTrack tracking algorithm (an improved version of ByteTrack) with a target-scale-based NWD-IoU dynamic adaptive fusion metric mechanism is optimized into a highly robust temporal state machine network. Its dynamic Kalman filter state prediction stage is as follows:

[0148] Once a new worker target first appears in the frame and is stably detected, an independent lifecycle is initialized for it, and an 8-dimensional Kalman temporal state vector is constructed. :

[0149] Formula (15);

[0150] in, The two-dimensional spatial coordinates of the center point of the worker's bounding box were anchored. The aspect ratio represents the width of the bounding box. This is the absolute height of the bounding box. Immediately following it... These are the first-order differential velocity components in the pixel plane that strictly correspond to these four geometric variables.

[0151] Before each new video frame arrives in the streaming media, the Kalman filter, based on a constant velocity motion model of the physical world, performs a predictive step of the state equation to anticipate the coordinate range in which the target might appear in the current frame:

[0152] Formula (16);

[0153] Formula (17);

[0154] In the above equation, The prior estimate of the state based on historical deduction, i.e., time... Based on the previous moment The estimated target state predicted by the information; For the previous moment The posterior state estimate, i.e., the output of the previous round of Kalman filtering; This is the system state transition matrix; The error covariance matrix, which characterizes the uncertainty of the forecast, is the error covariance matrix during the forecasting phase. It represents the variance of the forecasted data. The degree of uncertainty; For the previous moment The posterior error covariance matrix reflects the previous time step. The confidence level of the estimate; for The transpose of the matrix; Let be the process noise covariance matrix, which represents the sources of uncertainty in the system's dynamic model. Here, the crucial state transition matrix... By introducing inter-frame time difference A linear differential mapping relationship between kinematic variables was established:

[0155] Formula (18).

[0156] In some embodiments, in step 103 above, the data association logic executed using the ByteTrack tracking algorithm, which incorporates a target-scale-based NWD-IoU dynamic adaptive fusion metric mechanism, is as follows: Figure 3 As shown, apart from the target-scale-based NWD-IoU dynamic adaptive fusion metric mechanism, the specific implementations are all existing technologies:

[0157] 1. Input and Classification of Detection Boxes: Receive the set of all human-made detection boxes 401 from the output of the improved YOLO11 network for the current frame, and set a high-score confidence threshold (0.6) and a low-score confidence threshold (0.1). Human-made detection boxes with a confidence score ≥ 0.6 are assigned to the high-score human-made detection box set 402 (representing workers with complete features and no obvious occlusion), while human-made detection boxes with a confidence score between 0.1 and 0.6 are assigned to the low-score human-made detection box set 403 (representing workers with blurred edges or heavily occluded by building materials).

[0158] 2. Trajectory State Prediction: A Kalman filter is used to predict the state of the historical worker motion trajectories (404) from the previous video frame. Based on the target's motion model, the center point position and bounding box size of the existing trajectory in the current video frame are predicted, thus obtaining the predicted trajectory set (405) for the current video frame. This step provides spatial prior information for subsequent association matching.

[0159] 3. First Association 406: Extract the center point coordinates and width / height of all boxes in the high-resolution manually detected bounding boxes (402) and the predicted trajectory set, mapping them to a continuous two-dimensional normal Gaussian distribution. Calculate the second-order Wasserstein distance in the multidimensional feature space and introduce an adaptive scaling constant to convert it to NWD distance. Combine this with the traditional IoU based on area-based dynamic adaptive weights to generate the first-round cost matrix, which is then allocated using the Hungarian algorithm. Successfully matched trajectories are updated with their status updated, while unmatched high-resolution manually detected bounding boxes (402) are added to the remaining bounding box set, and unmatched trajectories are added to the remaining trajectory set.

[0160] 4. Second Association 407: Focusing on the low-scoring manually detected bounding boxes 403 and the remaining trajectory set from the first association, a second round of matching is performed. Since the low-scoring bounding boxes in the low-scoring manually detected bounding boxes 403 are usually small targets, severely occluded, or motion-blurred, traditional IoU is extremely sensitive to slight displacements of small targets (easily becoming 0), leading to ineffective matching. Therefore, this step primarily uses a NWD-IoU fusion association metric: the detected bounding boxes and predicted trajectory boxes are modeled as two-dimensional Gaussian distributions, using NWD to calculate a smooth and effective distance even in non-overlapping states, and combining this with bounding box location information (IoU) as a comprehensive similarity. If a match is successful, it indicates that the low-scoring bounding box belongs to the real target that is occluded or small, and its corresponding trajectory state is updated to achieve target retrieval; if the low-scoring bounding box still fails to match, it is considered background noise and directly deleted.

[0161] 5. Trajectory Management and Lifecycle Maintenance: For high-resolution detection boxes that did not match in the first round, if they meet the initialization conditions, they are initialized as new trajectories. Remaining trajectories that failed to match in both rounds of association are marked as lost. To handle long-term occlusion, the algorithm retains these lost trajectories for a certain period (e.g., 30 frames). If the trajectory is re-associated within 30 frames (especially if recovered via NWD in the second round of low-resolution box mining), its tracking state is restored; if it remains unmatched after 30 frames, it is permanently deleted from the trajectory set.

[0162] The frame rate is set to 30 frames per second (because 1 second of physical time corresponds to 30 FPS video). The typical occlusion time for workers passing under scaffolding is usually less than 1 second. This window period can ensure effective retrieval while avoiding invalid accumulation in the memory pool.

[0163] Step 104: Concatenate the worker motion trajectory sets carrying tags in all video frames of the video frame sequence in chronological order to obtain the complete motion trajectory of each worker in the video frame sequence.

[0164] Here, steps 101 to 103 are performed on each video frame in the video frame sequence. After the processing cycle for each video frame in the video frame sequence is completely finished, the coordinates of the detection boxes of all active workers in that video frame, along with their unique identifiers (IDs), are overlaid and rendered on the screen of that video frame sequence. Correspondingly, the sets of worker motion trajectories carrying tags in all video frames in the video frame sequence are spliced ​​together in time sequence, and a highly structured JSON or CSV trajectory sequence log is generated, recording the physical coordinates of the center point and kinematic features of each ID (worker) as it evolves over time, thus obtaining the complete motion trajectory of each worker in the video frame sequence.

[0165] The object detection and tracking method for overhead scenes provided in this application utilizes an improved YOLO11 to perform object detection on each video frame acquired during the method's execution. Specifically, the improved YOLO11's feature fusion stage abandons the crude channel splicing and linear addition of the traditional YOLO structure, replacing it entirely with the HD-Star module. This HD-Star module consists of a star-shaped basic fusion branch unit, a dual-path dynamic cueing local-context-aware branch unit, and an end-of-line reorganization unit. It creatively combines the non-linear feature multiplication effect of star mapping (Star Operation) with a dynamic conditional cueing mechanism driven by global context (Dynamic... (Prompt), and combined with local-global dual-granularity perception, it can actively and physically remove complex construction site background clutter from the underlying feature space (such as reflective building materials and shadow interference), thereby accurately extracting the edge features of extremely small workers even under the limited perspective of shooting from an extremely high angle, thus improving the target detection accuracy for each video frame; then, in the cross-frame spatiotemporal correlation stage, it completely breaks through the limitation of traditional ByteTrack relying solely on rigid intersection-union ratio for spatial alignment, and innovatively introduces a ByteTrack tracking algorithm based on the target scale NWD-IoU dynamic adaptive fusion metric mechanism. This mechanism can extract the edge features of tiny and extremely deformable workers. The worker bounding box is abstractly modeled as a continuous two-dimensional normal Gaussian distribution. Even if the predicted box and the detection box completely deviate in space due to slight camera shake or sudden changes in worker posture (causing the IoU to instantly drop to zero), this mechanism can still utilize the second-order optimal transmission distance of the Gaussian distribution to output a smooth and continuous similarity trajectory. Thus, this application proposes an end-to-end multi-worker tracking method based on the detection-by-detection paradigm. It has carried out a system-level reconstruction of the underlying mathematical logic of feature perception, background suppression and temporal correlation, which can improve the detection accuracy of worker movement trajectories in construction sites under overhead views.

[0166] It should be noted that, in the embodiments of this application, if the above-described object detection and tracking method for overhead scenes is implemented as a software functional module and sold or used as an independent product, it can also be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the embodiments of this application, or the part that contributes to the related technology, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause an electronic device to execute all or part of the methods described in the various embodiments of this application.

[0167] It should be understood that the phrase "one embodiment" or "an embodiment" throughout the specification means that a specific feature, structure, or characteristic related to the embodiment is included in at least one embodiment of this application. Therefore, "in one embodiment" or "in an embodiment" appearing throughout the specification does not necessarily refer to the same embodiment. Furthermore, these specific features, structures, or characteristics can be combined in any suitable manner in one or more embodiments. It should be understood that in the various embodiments of this application, the sequence numbers of the above-described processes do not imply a sequential order of execution; the execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of this application. The sequence numbers of the above-described embodiments are merely descriptive and do not represent the superiority or inferiority of the embodiments.

[0168] It should be noted that, in this document, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Unless otherwise specified, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes that element.

[0169] In the several embodiments provided in this application, it should be understood that the disclosed methods can be implemented in other ways.

[0170] The methods disclosed in the several method embodiments provided in this application can be arbitrarily combined without conflict to obtain new method embodiments.

[0171] The features disclosed in the methods provided in this application can be arbitrarily combined without conflict to obtain new method embodiments.

[0172] The above description is merely an embodiment of this application, but the scope of protection of this application is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in this application should be included within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.

Claims

1. A method for object detection and tracking in overhead scenes, characterized in that, The method includes: Obtain a video frame sequence of the construction site under test from an overhead view. Each video frame in the video frame sequence is input into the improved YOLO11 for target detection, resulting in a set of worker detection boxes carrying confidence scores in the video frame; wherein, the improved YOLO11 is a model that reconstructs the neck network of YOLO11 using a hierarchical dynamic star HD-Star module, and the HD-Star module is a module composed of a star-based fusion branch unit, a dual-path dynamic cueing local-context-aware branch unit, and an end reorganization unit; The ByteTrack tracking algorithm, which incorporates a target-scale-based NWD-IoU dynamic adaptive fusion metric mechanism, is used to correlate the set of worker detection boxes carrying confidence scores and the set of historical worker motion trajectories in the video frame to obtain the set of worker motion trajectories carrying labels in the video frame. By stitching together the set of worker motion trajectories carrying tags in all video frames of the video frame sequence in chronological order, the complete motion trajectory of each worker in the video frame sequence is obtained.

2. The method according to claim 1, characterized in that, The improved YOLO11 includes: a backbone network, a neck network with four HD-Star modules, and a detection head; the step of inputting each video frame in the video frame sequence into the improved YOLO11 for target detection, and obtaining a set of worker detection boxes carrying confidence scores in the video frame, includes: For each video frame in the video frame sequence, the backbone network is used to perform multi-layer convolution and downsampling operations on the video frame in sequence to obtain feature maps of different scales of the video frame. Using the neck network with four HD-Star modules, feature maps of different scales of the video frame are fused and enhanced to obtain the enhanced multi-scale features of the video frame. The enhanced multi-scale features of the video frame are input into the detection head to obtain a set of worker detection boxes carrying confidence scores in the video frame.

3. The method according to claim 2, characterized in that, The feature maps of the video frame at different scales include: shallow feature maps, mid-level feature maps, and deep feature maps with progressively decreasing resolution; the enhanced multi-scale features of the video frame include: first-scale features, second-scale features, and third-scale features with progressively decreasing resolution; the neck network with four HD-Star modules is used to fuse and enhance the feature maps of the video frame at different scales to obtain the enhanced multi-scale features of the video frame, including: The deep feature map is upsampled using the first upsampling module in the neck network to obtain a first upsampled map. The first HD-Star module is then used to fuse the first upsampled map with the middle feature map across layers to obtain a first intermediate feature map. The first C3k2 module is used to extract features from the first intermediate feature map to obtain mid-level features. The second upsampling module is used to upsample the mid-level features to obtain a second intermediate feature map. The second HD-Star module is used to fuse the second intermediate feature map and the shallow feature map across layers to obtain a third intermediate feature map. The second C3k2 module is used to extract features from the third intermediate feature map to obtain the first scale features. The first convolution module is used to downsample the first scale feature to obtain the first downsampled feature. The third HD-Star module is used to fuse the first downsampled feature with the middle layer feature to obtain the fourth intermediate feature map. The third C3k2 module is used to extract features from the fourth intermediate feature map to obtain the second scale feature. The second convolution module is used to downsample the second scale feature to obtain the second downsampled feature. The fourth HD-Star module is used to fuse the second downsampled feature with the deep feature map across layers to obtain the fifth intermediate feature map. The fourth C3k2 module is used to extract features from the fifth intermediate feature map to obtain the third scale feature.

4. The method according to claim 3, characterized in that, The step of using the first HD-Star module to perform cross-layer fusion of the first upsampled image and the intermediate feature map to obtain the first intermediate feature map includes: Using the star-shaped basic fusion branch unit of the first HD-Star module, the first upsampled image after convolution and the middle layer feature map after convolution are first multiplied element-wise to obtain fused features, and then the fused features are convolved and mapped to obtain basic fused features. Using the dual-path dynamic prompting local-context-aware branch unit of the first HD-Star module, feature extraction is first performed on the first upsampled image after convolution and the middle feature map after convolution under two different granularity perception windows to obtain the dual-granularity intermediate features of the first upsampled image and the dual-granularity intermediate features of the middle feature map. Then, using dynamic channel weights, adaptive recalibration is performed on the dual-granularity intermediate features of the first upsampled image and the dual-granularity intermediate features of the middle feature map to obtain the dual-granularity attention features of the first upsampled image and the dual-granularity attention features of the middle feature map. Using the end reshaping unit of the first HD-Star module, the basic fusion features, the dual-granularity attention features of the first upsampled map, and the dual-granularity attention features of the middle layer feature map are sequentially spliced, convolutionally compressed, and RepConv reshaped in the channel dimension to obtain the first intermediate feature map.

5. The method according to claim 4, characterized in that, The dual-path dynamic cueing local-context-aware branch unit includes: two attention branches; the dynamic channel weights include: a first weight of the first upsampled image and a second weight of the middle layer feature map; Before using dynamic channel weights to adaptively recalibrate the dual-granularity intermediate features of the first upsampled image and the dual-granularity intermediate features of the middle-layer feature map to obtain the dual-granularity attention features of the first upsampled image and the dual-granularity attention features of the middle-layer feature map, the process includes: Using the global average pooling layer in the two attention branches, environmental context information is extracted from the first upsampled image and the middle layer feature map respectively, to obtain the context features of the first upsampled image and the context features of the middle layer feature map; A lightweight multilayer perceptron and activation function in two attention branches are used to perform linear mapping and nonlinear activation on the context features of the first upsampled map and the context features of the middle layer feature map respectively, to obtain the first weight and the second weight. The step of adaptively recalibrating the dual-granularity intermediate features of the first upsampled image and the dual-granularity intermediate features of the middle-layer feature map using dynamic channel weights to obtain the dual-granularity attention features of the first upsampled image and the dual-granularity attention features of the middle-layer feature map includes: Using the first weight, the dual-granularity intermediate features of the first upsampled image are adaptively recalibrated to obtain the dual-granularity attention features of the first upsampled image. Then, using the second weight, the dual-granularity intermediate features of the middle-layer feature map are adaptively recalibrated to obtain the dual-granularity attention features of the middle-layer feature map.

6. The method according to any one of claims 1 to 5, characterized in that, The construction process of the improved YOLO11 includes: Obtain the initial detection model built by YOLOv11, and reconstruct the neck network of the initial detection model based on the HD-Star module to obtain the detection model to be trained; Acquire multiple videos of construction workers from an existing construction site, taken from an overhead view and under different lighting conditions; Using the image library corresponding to the multiple videos of construction workers, the detection model to be trained is trained based on joint loss to obtain the improved YOLO11; wherein, the joint loss includes: binary cross-entropy loss, complete intersection-union ratio loss and distribution focus loss.

7. The method according to claim 1, characterized in that, The target-scale-based NWD-IoU dynamic adaptive fusion metric mechanism is as follows: ; ; ; ; ; in, The manual detection box for the video frame The prediction detection box of the video frame Similarity between them; prediction detection boxes of the video frames This was predicted using Kalman filtering based on the historical worker movement trajectory set; For adaptive fusion penalty weights; The manual detection box for the video frame The prediction detection box of the video frame The normalized Wasserstein distance between them; It is an exponential function; The manual detection box for the video frame The area of ​​the predicted detection box of the video frame The ratio of intersection to union of areas; The manual detection box for the video frame The prediction detection box of the video frame Wasserstein distance between them; This is the smooth scaling constant; and The manual detection boxes for the video frames are respectively The center coordinates and absolute width and height; and The prediction detection boxes for the video frames are respectively The center coordinates and absolute width and height; and These are the preset minimum bounding box area and the preset maximum bounding box area, respectively; and Manually detected bounding boxes for video frames The area of ​​the predicted detection frame of the video frame The area.

8. The method according to claim 1, characterized in that, When the video frame is the first frame in the video frame sequence, the set of historical worker movement trajectories is generated using temporal physical dynamics. If the video frame is not the first frame in the video frame sequence, the set of worker movement trajectories in the previous video frame associated with the video frame in the video frame sequence is determined as the historical worker movement trajectory set.