Behavior data labeling processing method and system, and electronic device

By using a hierarchical pre-labeled model and a confidence-driven differential labeling strategy, the problems of low labeling efficiency and insufficient accuracy of behavior recognition models are solved, achieving efficient and accurate behavior data labeling.

CN122244840APending Publication Date: 2026-06-19ANHUI KAIYANG TECHNOLOGY CO LTD +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
ANHUI KAIYANG TECHNOLOGY CO LTD
Filing Date
2026-03-27
Publication Date
2026-06-19

Smart Images

  • Figure CN122244840A_ABST
    Figure CN122244840A_ABST
Patent Text Reader

Abstract

This invention provides a method, system, and electronic device for behavioral data annotation and processing, relating to the field of image data processing. The method, through innovative model structure design, can accurately identify small targets and their complex poses, significantly improving the annotation accuracy of behavioral data. In addition, the method can achieve automatic batch annotation and differentiated human-machine annotation of behavioral data by combining annotation processing strategies with confidence results, greatly improving annotation efficiency while ensuring the annotation effect of behavioral data.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of image data processing, and in particular to a method, system, and electronic device for behavioral data annotation and processing. Background Technology

[0002] Using relevant behavior recognition models to identify and analyze distracted behaviors of drivers, such as smoking, drinking water, and making phone calls, has become an important means to improve vehicle driving safety and enhance vehicle intelligence. The training of behavior recognition models relies on a large amount of labeled behavioral data, and the current mainstream labeling methods are mainly divided into two categories: manual labeling and fully automated labeling.

[0003] While manual annotation can guarantee a high accuracy rate, it suffers from low annotation efficiency, high labor costs, and susceptibility to subjective factors. Especially when the data volume is large, manual annotation is difficult to meet the time requirements of practical applications. Fully automatic annotation directly outputs annotation results through preset algorithms. Although it is more efficient, it is prone to misjudgment in complex scenarios such as the subtle movements of the hand and cigarette when smoking, the different shapes of the water glass when drinking water, and the relative position changes of the mobile phone and head when making a phone call. This results in insufficient annotation accuracy and affects the training effect of subsequent behavior recognition models. Summary of the Invention

[0004] In view of this, the purpose of this invention is to provide a method, system and electronic device for behavioral data annotation and processing. This method, through innovative model structure design, can accurately identify small targets and their complex postures, significantly improving the annotation accuracy of behavioral data. In addition, this method can achieve automatic batch annotation and differentiated human-machine annotation of behavioral data by combining annotation processing strategies with confidence results, thereby greatly improving annotation efficiency while ensuring the annotation effect of behavioral data.

[0005] In a first aspect, embodiments of the present invention provide a behavioral data annotation and processing method, the method comprising: Collect attitude images of target objects in vehicles, and construct an initial sample set and a sample set to be labeled based on the behavioral data of the attitude images; A pre-labeled model is constructed based on the target object. The pre-labeled model is then trained using an initial sample set to obtain a basic training model. The backbone network of the pre-labeled model has a hierarchical structure. Use the basic training model to obtain the pre-labeled samples and their corresponding confidence results for the sample set to be labeled; Based on the confidence results, the annotation processing strategy corresponding to the pre-labeled samples is determined, and the annotation processing strategy is used to obtain the labeled dataset corresponding to the behavioral data.

[0006] Optionally, the steps of acquiring pose images of target objects in the vehicle and constructing an initial sample set and a sample set to be labeled for the target objects based on the behavioral data from the pose images include: Identify the target objects contained in the cockpit area of ​​the target vehicle; these target objects include at least a mobile phone, cigarettes, and a water cup. When a target vehicle is detected traveling in the target route, multiple pose images of the target object are collected and acquired using the camera components deployed in the target vehicle. The behavior data of the posture image is determined based on the type parameters corresponding to the target object. A first image set is randomly selected from the posture image according to the behavior data, and a second image set corresponding to the first image set is determined using the posture image. After labeling the first image set using type parameters, the initial sample set corresponding to the target object is obtained; The set of sample samples to be labeled corresponding to the target object is determined based on the second image set.

[0007] Optionally, a pre-labeled model can be built based on the target object, including: Initialize the target detection model by the type parameter corresponding to the target object, and obtain the backbone network, neck network and head network corresponding to the target detection model; The backbone network is updated using multiple sequentially connected convolutional neural network modules. The backbone network acquires a first-scale feature map and a second-scale feature map according to preset downsampling parameters. The resolution of the first-scale feature map is greater than that of the second-scale feature map. The first-scale feature map is used to detect the first action data corresponding to the first target object, and the second-scale feature map is used to detect the second action data corresponding to the second target object. The size of the first target object is smaller than that of the second target object. The first target detection head and the second target detection head corresponding to the target detection model are constructed based on the first scale feature map and the second scale feature map, respectively, and the neck network and the head network are updated through the first target detection head and the second target detection head; A pre-labeled model is constructed based on the updated backbone network, neck network, and head network; the pre-labeled model is used to determine behavioral data through first-scale feature maps and second-scale feature maps.

[0008] Optionally, a base training model can be obtained by training the pre-labeled model using the initial sample set, including: The first, second, and third convolutional modules corresponding to the first target detection head in the pre-labeled model are determined. The first convolutional module is used to perform feature enhancement processing on the first scale feature map; the second convolutional module is used to obtain the category prediction result and bounding box prediction result corresponding to the first row of data; and the third convolutional module is used to update the parameter quantity corresponding to the first target detection head. Based on the prediction results, determine the category classification loss function corresponding to the pre-labeled model, and based on the bounding box prediction results, determine the bounding box regression loss function corresponding to the pre-labeled model. The training parameters corresponding to the pre-labeled model are determined by the first convolutional module, the second convolutional module, and the third convolutional module; The pre-labeled model is trained using the initial sample set and according to the training parameters, and the loss value corresponding to the pre-labeled model is obtained in real time using the category classification loss function and the bounding box regression loss function. When the loss value meets the preset loss threshold relationship, the training process of the pre-labeled model is stopped, and the current pre-labeled model is determined as the base training model.

[0009] Optionally, the step of obtaining the pre-labeled samples and their corresponding confidence results corresponding to the sample set to be labeled using the basic training model includes: Determine the confidence threshold and intersection-over-union (IoU) threshold for the sample set to be labeled based on the type parameters corresponding to the target object; Determine the batch inference strategy corresponding to the basic training model based on the confidence threshold and the intersection-union ratio threshold; The sample set to be labeled is input into the basic training model. The batch inference strategy controls the basic training model to output the pre-labeled information corresponding to the target object, and obtains the category result, bounding box coordinates and category confidence of the target object contained in the pre-labeled information. The pre-labeled samples corresponding to the sample set to be labeled are determined using the category results and bounding box coordinates, and the confidence results corresponding to the pre-labeled samples are determined using the category confidence.

[0010] Optionally, the annotation processing strategy corresponding to the pre-labeled samples is determined based on the confidence results, including: Determine the confidence value corresponding to the confidence result based on the category confidence; If the confidence value is not less than the first confidence threshold, then the first annotation processing strategy corresponding to the pre-annotated sample is determined based on the bounding box coordinates and the category result; If the confidence value is greater than the second confidence threshold and less than the first confidence threshold, then the second labeling processing strategy corresponding to the pre-labeled sample is determined based on the category confidence. If the confidence value is equal to the second confidence threshold, then the third annotation processing strategy corresponding to the pre-annotated sample is determined based on the type parameter corresponding to the target object.

[0011] Optionally, an annotated dataset corresponding to the behavioral data can be obtained using an annotation processing strategy, including: When the annotation processing strategy is the first annotation processing strategy, the bounding box corresponding to the pose image is determined by using the bounding box coordinates, and the first annotation dataset corresponding to the behavior data under the category result is determined based on the intersection-union ratio between the bounding box and the target object. When the annotation processing strategy is the second annotation processing strategy, the second annotation dataset is determined based on the pre-annotated samples, and the pre-annotated samples are updated to the initial sample set and the sample set to be annotated based on the class confidence. When the annotation processing strategy is the third annotation processing strategy, the third image set that has been annotated is randomly selected from the pre-annotated samples according to the behavioral data, and the third annotation dataset is determined based on the third image set.

[0012] Optionally, after the step of obtaining the labeled dataset corresponding to the behavioral data using the annotation processing strategy, the method further includes: The labeled sample set corresponding to the basic training model is determined based on the first and third labeled datasets; After updating the labeled sample set to the initial sample set, the pre-labeled model is trained using the updated initial sample set to obtain the iterative training model corresponding to the basic training model.

[0013] Secondly, the present invention provides a behavioral data annotation and processing system, the system comprising: Sample set construction module: used to collect posture images of target objects in vehicles, and construct the initial sample set and the sample set to be labeled of the target objects based on the behavioral data of the posture images; Basic training model building module: used to build a pre-labeled model based on the target object. The basic training model is obtained by training the pre-labeled model using the initial sample set; the backbone network of the pre-labeled model is a hierarchical structure. Pre-labeled sample acquisition module: used to obtain pre-labeled samples and their corresponding confidence results for the sample set to be labeled using the basic training model; The labeled dataset determination module is used to determine the labeling strategy corresponding to the pre-labeled samples based on the confidence results, and to obtain the labeled dataset corresponding to the behavioral data using the labeling strategy.

[0014] Thirdly, embodiments of the present invention also provide an electronic device, which includes a processor and a memory, the memory storing computer-executable instructions that can be executed by the processor, and the processor executing the computer-executable instructions to implement the steps of the behavioral data annotation processing method provided in the first aspect.

[0015] This invention provides a method, system, and electronic device for behavioral data annotation. In the process of annotating behavioral data of a vehicle driver, the method first acquires posture images of targets within the vehicle, and constructs an initial sample set and a sample set to be annotated based on the behavioral data from the posture images. Then, a pre-annotation model is constructed based on the targets, and a basic training model is obtained by training the pre-annotation model using the initial sample set. The backbone network of the pre-annotation model has a hierarchical structure, and a target detection head corresponding to the target is included in the pre-annotation model. The target detection head is used to acquire behavioral data using the target scale feature map output by the backbone network. Subsequently, the pre-annotated samples corresponding to the sample set to be annotated and their corresponding confidence results are obtained using the basic training model. Finally, an annotation processing strategy corresponding to the pre-annotated samples is determined based on the confidence results, and the annotation processing strategy is used to obtain the labeled dataset corresponding to the behavioral data. This method, through innovative model structure design, can accurately identify small targets and their complex postures, significantly improving the annotation accuracy of behavioral data. Furthermore, this method can achieve automatic batch annotation and differentiated human-machine annotation of behavioral data by combining the annotation processing strategy with the confidence results, greatly improving annotation efficiency while ensuring the annotation effect of behavioral data.

[0016] Other features and advantages of the invention will be set forth in the following description, and will be apparent in part from the description, or may be learned by practicing the invention. The objects and other advantages of the invention are realized and obtained through the structures particularly pointed out in the description and the drawings.

[0017] To make the above-mentioned objects, features and advantages of the present invention more apparent and understandable, preferred embodiments are described below in detail with reference to the accompanying drawings. Attached Figure Description

[0018] To more clearly illustrate the specific embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the specific embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of the present invention. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.

[0019] Figure 1 A flowchart of a behavioral data annotation and processing method provided in an embodiment of the present invention; Figure 2 This is a flowchart of step S101 in a behavioral data annotation and processing method provided in an embodiment of the present invention; Figure 3 In step S102 of the behavioral data annotation processing method provided in this embodiment of the invention, a flowchart is shown for constructing a pre-annotation model based on the target object. Figure 4In step S102 of the behavioral data annotation processing method provided in this embodiment of the invention, a flowchart is shown below showing how to obtain a basic training model after training a pre-annotated model using an initial sample set. Figure 5 This is a flowchart of step S103 in a behavioral data annotation and processing method provided in an embodiment of the present invention; Figure 6 In step S104 of the behavioral data annotation processing method provided in this embodiment of the invention, a flowchart is shown for determining the annotation processing strategy corresponding to the pre-annotated sample based on the confidence result. Figure 7 In step S104 of the behavior data annotation processing method provided in this embodiment of the invention, a flowchart is shown for obtaining the annotation dataset corresponding to the behavior data using an annotation processing strategy. Figure 8 A flowchart following step S104 of a behavior data annotation and processing method provided in an embodiment of the present invention; Figure 9 A flowchart of another behavioral data annotation and processing method provided in an embodiment of the present invention; Figure 10 This is a schematic diagram of the structure of a pre-labeled model in a behavioral data annotation processing method provided in an embodiment of the present invention; Figure 11 This is a schematic diagram of the structure of a behavioral data annotation and processing system provided in an embodiment of the present invention; Figure 12 This is a schematic diagram of another behavioral data annotation and processing system provided in an embodiment of the present invention; Figure 13 This is a schematic diagram of the structure of an electronic device provided in an embodiment of the present invention.

[0020] icon: 100 - Sample set construction module; 200 - Basic training model construction module; 300 - Pre-labeled sample acquisition module; 400 - Labeled dataset determination module; 500 - Uncertainty assessment module; 600 - Model iteration module; 101 - Processor; 102 - Memory; 103 - Bus; 104 - Communication interface. Detailed Implementation

[0021] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below in conjunction with the embodiments. Obviously, the described embodiments are only some embodiments of the present invention, not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0022] To facilitate understanding of this embodiment, a behavioral data annotation and processing method disclosed in this embodiment of the invention will first be introduced, such as... Figure 1 As shown, the method includes: Step S101: Collect the posture images of the target objects in the vehicle, and construct the initial sample set and the sample set to be labeled for the target objects based on the behavioral data of the posture images.

[0023] The system collects posture images of the target (i.e., the driver) during vehicle movement, focusing on capturing image frames related to distracting behaviors such as smoking, drinking water, and making phone calls. Simultaneously, scene parameters corresponding to these images (such as ambient lighting and road conditions) are recorded to ensure sample diversity. Based on the collected posture images, effective behavioral data containing characteristics of driver distraction are extracted and rationally divided into an initial sample set and a sample set to be labeled. The initial sample set is used for training the subsequent pre-labeled model, while the sample set to be labeled is used to obtain accurate labeled data through model pre-labeling and differential processing, laying a data foundation for the efficient advancement and accuracy assurance of the entire labeling process.

[0024] Step S102: Construct a pre-labeled model based on the target object, and train the pre-labeled model using the initial sample set to obtain a basic training model; wherein, the backbone network of the pre-labeled model is a hierarchical structure.

[0025] A pre-labeled model is constructed based on the features of target objects (drivers and related small targets such as cigarettes, mobile phones, and water cups). This model is based on the YOLOv8 framework and features lightweight and high-precision optimization design: specifically, the backbone network is replaced with a customized MobileNetV3-Small hierarchical structure, effectively reducing model parameters and computational complexity; for example, a 160×160 multi-scale feature branch and a dedicated lightweight target detection head can be added to enhance the feature extraction capability and detection accuracy of small targets such as cigarettes. After the model is constructed, it is trained using the initial sample set constructed in step S101. By iteratively adjusting the model parameters, a stable and accurate basic training model is finally obtained.

[0026] Step S103: Use the basic training model to obtain the pre-labeled samples and their corresponding confidence results for the sample set to be labeled.

[0027] The sample set to be labeled constructed in step S101 is input in batches into the basic training model trained in step S102. The model automatically completes the pre-labeling of all samples to be labeled, and outputs the corresponding behavior category (such as distracting behavior like smoking, drinking water, or making a phone call, or normal driving behavior) and specific labeling results for each sample. At the same time, the model will simultaneously calculate and output the confidence score for each pre-labeled sample. This confidence score is used to quantify the reliability of the pre-labeling results, providing a core judgment basis for the subsequent formulation of differentiated labeling strategies, realizing efficient batch pre-labeling of samples to be labeled, and significantly reducing the basic workload of manual labeling.

[0028] Step S104: Determine the annotation processing strategy corresponding to the pre-annotated samples based on the confidence results, and use the annotation processing strategy to obtain the labeled dataset corresponding to the behavioral data.

[0029] Based on the confidence scores output in step S103, a confidence stratification mechanism is adopted to divide all pre-labeled samples into three categories: high confidence, medium confidence, and no confidence. Differentiated collaborative labeling strategies are implemented for different categories to balance labeling efficiency and manpower costs while ensuring labeling accuracy. Specifically, high-confidence samples (with high reliability of pre-labeling results) are processed using a "rapid manual inspection" mode, requiring only quick manual verification without re-labeling. No-confidence samples (with invalid or unrecognizable pre-labeling results) are processed using a "sampling labeling + residual merging" strategy. A portion of the samples are fully manually labeled, and the labeled samples are merged into the initial sample set, while the remaining unsampled samples are retained for further processing. Medium-confidence samples (with uncertain pre-labeling results) are temporarily stored and will be re-labeled after iterative optimization of the basic training model. Through these differentiated processing steps, a precise and standardized behavioral data labeling dataset is obtained, providing high-quality data support for the subsequent training of the driver distraction behavior recognition model.

[0030] Optionally, step S101 involves acquiring pose images of the target object within the vehicle and constructing an initial sample set and a sample set to be labeled based on the behavioral data from the pose images. Figure 2 As shown, it includes: Step S201: Identify the target objects contained in the cockpit area of ​​the target vehicle; wherein the target objects include at least a mobile phone, cigarettes, and a water cup.

[0031] First, the types of target objects that need to be collected and labeled in the driver's cabin area of ​​the target vehicle are clearly identified. Considering the identification requirements for three typical distraction behaviors of drivers—smoking, drinking water, and making phone calls—at least three core items are determined: mobile phones, cigarettes, and water cups. This classification clarifies the core objects for subsequent image acquisition, behavioral data extraction, and sample labeling, defining a clear scope for the entire sample construction process and ensuring that the collection work accurately meets the actual needs of distraction behavior identification.

[0032] Step S202: When the target vehicle is detected to be traveling in the target driving route, the camera components deployed in the target vehicle are used to collect and acquire multiple pose images corresponding to the target object.

[0033] When the executing entity (such as the host computer, controller, server, etc.) detects that the target vehicle is driving normally on the preset target route, the camera component deployed in the target vehicle's cockpit is activated to carry out real-time acquisition of the target object's posture images. The camera component uses an in-vehicle high-definition camera with parameters set to a resolution of 1920×1080 and a frame rate of 30fps to ensure the clarity and continuity of the acquired images. The acquisition process strictly adheres to real-world driving scenarios, covering four key time periods: morning rush hour (7:00-9:00), midday (12:00-14:00), evening rush hour (17:00-19:00), and nighttime (21:00-23:00) to obtain target object posture data under different lighting conditions. The acquisition routes cover various scenarios such as urban main roads, expressways, residential roads, and parking lots, including complex backgrounds such as dense pedestrian areas, dense billboards, and tree obstructions, as well as simple backgrounds such as empty parking lots and suburban roads, ensuring that the acquired posture images are comprehensive and diverse, adaptable to the complex scenario requirements of subsequent model training.

[0034] Step S203: Determine the behavior data of the posture image according to the type parameter corresponding to the target object, randomly select the first image set from the posture image according to the behavior data, and use the posture image to determine the second image set corresponding to the first image set.

[0035] Based on the type parameters corresponding to the target objects (i.e., the specific posture and type of each target object, including postures such as holding a mobile phone vertically with the ear, holding it horizontally, and partially obscured by the steering wheel; postures such as holding a cigarette or holding it in the mouth; and postures such as holding a transparent glass, a thermos, or a coffee cup), all the collected posture images were classified to clarify the behavioral data and associated distraction behavior categories for each image. Subsequently, according to the distribution of behavioral data, samples were randomly selected from all the collected posture images (ultimately forming a raw dataset of 150,000 images) to form the first image set. The selection criteria were to extract 500 samples from each of the three target object categories: mobile phones, cigarettes, and water cups, corresponding to distraction behaviors. All remaining unselected posture images were used as the second image set corresponding to the first image set. Together, they constitute the complete raw dataset.

[0036] Step S204: After annotating the first image set using the type parameter, the initial sample set corresponding to the target object is obtained.

[0037] Based on the type parameters of the target objects (the pose and type standards of various target objects), the first image set is precisely labeled. The labeling content must clearly distinguish the specific type and pose of the target object, as well as the corresponding distraction behavior category, to ensure the accuracy, standardization, and consistency of the labeling results. After all images in the first image set are labeled, this labeled sample set becomes the initial sample set corresponding to the target objects, which is used for the basic training of the subsequent pre-labeled model (basic training model), providing high-quality labeled samples to support the model.

[0038] Step S205: Determine the sample set to be labeled corresponding to the target object based on the second image set.

[0039] Based on the second image set divided in step S203, it is first screened and sorted to remove invalid samples such as blurry images, severely occluded target objects, and those without valid target objects, ensuring sample quality. After screening, the remaining valid samples are the sample set to be labeled corresponding to the target objects. This sample set will be input into the trained basic model for batch pre-labeling and is the core object for subsequent differential labeling processing, laying the foundation for efficient and accurate batch labeling.

[0040] Optionally, a pre-labeled model can be built based on the target object, such as Figure 3 As shown, it includes: Step S301: Initialize the target detection model using the type parameters corresponding to the target object, and obtain the backbone network, neck network, and head network corresponding to the target detection model.

[0041] First, based on the type parameters of the target object (i.e., the pose, size, and features of the target object such as a mobile phone, cigarette, or water cup), a target detection model based on the YOLOv8 framework is initialized. After initialization, the three core network structures of the target detection model are identified and extracted: the backbone network responsible for feature extraction, the neck network responsible for feature fusion, and the head network responsible for target prediction. This lays the foundation for subsequent optimization and updates of the network structure, ensuring that the initialized model framework meets the detection requirements for targets related to driver distraction.

[0042] Step S302: Update the backbone network using multiple sequentially connected convolutional neural network modules; wherein, the backbone network obtains a first-scale feature map and a second-scale feature map according to preset downsampling parameters; wherein, the resolution of the first-scale feature map is greater than the resolution of the second-scale feature map; the first-scale feature map is used to detect the first behavioral data corresponding to the first target object, and the second-scale feature map is used to detect the second behavioral data corresponding to the second target object, and the size of the first target object is smaller than the size of the second target object.

[0043] Multiple sequentially connected convolutional neural network modules are used to update and optimize the backbone network obtained in step S301, ultimately replacing it with a customized MobileNetV3-Small backbone network. This updated backbone network adopts a MobileBottleneck hierarchical structure, achieving a balance between model lightweighting and feature extraction capabilities through the synergistic effect of depthwise separable convolutions, channel attention mechanisms, and dynamic nonlinear activation functions. This reduces model parameters and computational complexity while accurately capturing target object features. Simultaneously, the backbone network automatically outputs feature maps of four different resolutions—160×160, 80×80, 40×40, and 20×20—according to preset downsampling parameters. The highest resolution 160×160 feature map serves as the first-scale feature map (P2), specifically used to detect the first action data corresponding to smaller target objects such as cigarettes. The other three lower-resolution feature maps serve as second-scale feature maps, which, compared to traditional upsampling feature maps, retain richer target object details and are used to detect the second action data corresponding to relatively larger target objects such as mobile phones and water cups.

[0044] Step S303: Construct the first target detection head and the second target detection head corresponding to the target detection model based on the first scale feature map and the second scale feature map, respectively, and update the neck network and the head network through the first target detection head and the second target detection head.

[0045] Based on the first-scale feature map (160×160) and the second-scale feature maps (80×80, 40×40, 20×20) output in step S302, the first and second target detection heads corresponding to the target detection model are constructed respectively. The first target detection head specifically corresponds to the first-scale feature map and is used to enhance the detection accuracy of small targets such as cigarettes. It adopts a two-layer structure of "feature enhancement + accurate prediction": the first layer is a depthwise separable convolutional module (consisting of a 3×3 depthwise convolution, Padding=1, and a 1×1 pointwise convolution), used to perform local feature enhancement on the 160×160 high-resolution feature map, accurately capturing the subtle contour information of small targets; the second layer is a 1×1 convolutional layer responsible for outputting the class prediction and bounding box prediction results. The class prediction branch uses the BCE loss function to optimize classification accuracy, and the bounding box prediction branch uses the CIoU loss function to optimize localization error. Simultaneously, a depthwise separable convolutional module is introduced after the convolutional layer to reduce the number of parameters while maintaining feature representation capabilities. The second target detection head corresponds to the original second-scale feature map, and the optimized structure is used to adapt to the detection of larger targets. After the dual target detection head is constructed, the neck network (used for feature fusion) and the head network (used for prediction output) are updated synchronously according to the feature output requirements of the two detection heads to ensure that the networks work together to form a complete multi-scale detection system that covers targets of different sizes.

[0046] Step S304: Construct a pre-labeled model based on the updated backbone network, neck network, and head network; wherein, the pre-labeled model is used to determine behavioral data through the first-scale feature map and the second-scale feature map.

[0047] The backbone network updated in step S302, and the neck and head networks updated in step S303 are integrated to ensure that the three network structures are synergistically adapted and their parameters are unified, forming a complete pre-labeled model. This pre-labeled model determines behavioral data through first-scale feature maps and second-scale feature maps. It has the core advantages of being lightweight and highly accurate. It can efficiently extract multi-scale target features through the optimized backbone network and accurately detect targets of different sizes through dual-target detection heads. In particular, it enhances the detection capability of small targets such as cigarettes, providing a stable model architecture support for subsequent pre-training using the initial sample set to obtain a basic training model.

[0048] Based on the pre-labeled model described above, a basic training model is obtained by training the pre-labeled model using an initial sample set, such as... Figure 4 As shown, it includes: Step S401: Determine the first convolutional module, the second convolutional module, and the third convolutional module corresponding to the first target detection head in the pre-labeled model; wherein, the first convolutional module is used to perform feature enhancement processing on the first scale feature map; the second convolutional module is used to obtain the category prediction result and bounding box prediction result corresponding to the first row of data; and the third convolutional module is used to update the parameter quantity corresponding to the first target detection head.

[0049] First, the three core convolutional modules of the first target detection head (corresponding to a 160×160 first-scale feature map, used for small target detection) in the pre-labeled model are accurately located, and the functional positioning of each module is clarified: The first convolutional module is a depthwise separable convolutional module (composed of 3×3 depthwise convolution, Padding=1 and 1×1 pointwise convolution), whose core function is to enhance local features of the first-scale feature map and accurately capture the subtle contour information of small targets such as cigarettes; the second convolutional module is a 1×1 convolutional layer, which is mainly used to output the category prediction results (such as smoking behavior) and bounding box prediction results of the first behavior data corresponding to the first target (small target), realizing target classification and localization; the third convolutional module is also a depthwise separable convolutional module, deployed after the second convolutional module, whose core function is to optimize and update the parameters of the first target detection head, reduce the computational cost while maintaining the feature expression capability, and adapt to the requirements of lightweight models.

[0050] Step S402: Determine the category classification loss function corresponding to the pre-labeled model based on the prediction results, and determine the bounding box regression loss function corresponding to the pre-labeled model based on the bounding box prediction results.

[0051] Based on the prediction results output by the second convolutional module in step S401, two types of loss functions required for training the pre-labeled model are determined, forming a combined loss function to balance classification accuracy and localization accuracy: First, based on the category prediction results, the category classification loss function is determined to be the BCE loss function, used to optimize the model's classification accuracy for target object categories (such as mobile phones, cigarettes, and water cups) and corresponding distraction behaviors, reducing misclassification; second, based on the bounding box prediction results, the bounding box regression loss function is determined to be the CIoU loss function, used to optimize the localization accuracy of target object bounding boxes, reduce bounding box prediction errors, and is especially suitable for the precise localization needs of small targets. The two types of loss functions work together to provide the core basis for evaluating the model training effect.

[0052] Step S403: Determine the training parameters corresponding to the pre-labeled model through the first convolutional module, the second convolutional module, and the third convolutional module.

[0053] Based on the characteristics of the three convolutional modules of the first object detection head, the lightweight requirements of the pre-labeled model, and the training hardware environment, the parameters for model training were determined to ensure efficient and stable training while effectively preventing overfitting: the batch size was set to 16 to adapt to the hardware performance of the NVIDIA Quadro RTX 4000 (16GB VRAM), ensuring training speed and stability; the initial learning rate was set to 0.001, using the AdamW learning rate scheduling strategy with a learning rate decay period of 10 rounds to achieve dynamic adjustment of the learning rate and improve model convergence speed; the weight decay coefficient was set to 0.0005 to improve the model's generalization ability by suppressing overfitting of model parameters; and auxiliary parameters such as the number of training rounds and validation frequency were specified to provide clear standards for subsequent model training.

[0054] Step S404: Train the pre-labeled model using the initial sample set and according to the training parameters, and obtain the loss value corresponding to the pre-labeled model in real time using the category classification loss function and the bounding box regression loss function.

[0055] Within the PyTorch 2.0 deep learning framework, the initial sample set constructed in step S204 is input into the pre-labeled model. The model training process is initiated according to the training parameters determined in step S403, with a total of 100 training rounds. During training, every 10 rounds, 20% of the initial labeled dataset is used as a validation set to validate the model and evaluate its training performance. Simultaneously, the loss value of the pre-labeled model in each training round is calculated and obtained in real time using the class classification loss function (BCE) and bounding box regression loss function (CIoU) determined in step S402. This allows for real-time monitoring of the model's convergence and provides data support for determining when to stop training.

[0056] Step S405: When the loss value meets the preset loss threshold relationship, stop the training process of the pre-labeled model and determine the current pre-labeled model as the basic training model.

[0057] During training, the real-time loss value is continuously monitored to determine whether it meets the preset loss threshold relationship (i.e., the loss value tends to stabilize and no longer decreases significantly, or the loss value is lower than the preset threshold, and the validation set accuracy reaches the preset standard). When the loss value meets this threshold relationship, it indicates that the model has achieved the preset training effect, can accurately extract target features, and can effectively detect small and large targets. At this point, the training process of the pre-labeled model is stopped. The pre-labeled model that has been trained and has stable performance is determined as the basic training model for subsequent batch pre-labeling, providing model support for the pre-labeling work in step S103.

[0058] Optionally, step S103, which uses the basic training model to obtain the pre-labeled samples corresponding to the sample set to be labeled and their corresponding confidence results, is as follows: Figure 5 As shown, it includes: Step S501: Determine the confidence threshold and crossover ratio threshold of the sample set to be labeled based on the type parameters corresponding to the target object.

[0059] By combining the type parameters of the target objects (i.e., the size, posture, and feature differences of target objects such as mobile phones, cigarettes, and water cups, especially taking into account the detection needs of small targets such as cigarettes), two core inference thresholds are scientifically set to ensure the accuracy of subsequent batch inference: the first is the confidence threshold, set to 0.5, used to filter obvious erroneous predicted boxes during the inference process, avoiding invalid annotation information from interfering with subsequent processing; the second is the intersection-over-union (IoU) threshold, set to 0.7, used for non-maximum suppression (NMS) processing to prevent the same target object from being labeled multiple times, ensuring the uniqueness and standardization of pre-labeling results. The setting of these two thresholds is fully adapted to the detection capabilities of the basic training model, balancing inference efficiency and pre-labeling accuracy.

[0060] Step S502: Determine the batch inference strategy corresponding to the basic training model based on the confidence threshold and the intersection-union ratio threshold.

[0061] Based on the confidence threshold and intersection-union (IoU) threshold determined in step S501, a batch inference strategy corresponding to the basic training model is formulated, and the operational specifications of the inference process are clarified. This strategy mainly includes three parts: first, a batch input rule for samples, which inputs the sample set to be labeled into the basic training model in preset batches to improve inference efficiency; second, a bounding box filtering rule, which uses a confidence threshold of 0.5 to filter out erroneous bounding boxes with confidence levels below this threshold, retaining valid prediction results; and third, a redundant labeling handling rule, which performs NMS operations using an IoU threshold of 0.7 to remove duplicate labels for the same target object, while specifying that the storage format of the pre-labeled results is a PASCAL VOC format XML file to ensure the convenience of subsequent sample classification and manual processing.

[0062] Step S503: Input the sample set to be labeled into the basic training model, and control the basic training model to output the pre-labeled information corresponding to the target object through the batch inference strategy, and obtain the category result, bounding box coordinates and category confidence of the target object contained in the pre-labeled information.

[0063] The sample set to be labeled constructed in step S205 is input in batches into the basic training model obtained in step S405 according to the batch inference strategy defined in step S502. The basic training model then performs synchronous inference on all samples to be labeled. During the inference process, the model will automatically identify the target objects in the samples, output the pre-labeled information corresponding to the target objects, and simultaneously extract three core parameters from the pre-labeled information: first, the target object category result (clearly defining the type of target object corresponding to the labeled sample, such as mobile phone, cigarette, water cup and corresponding distraction behavior); second, the bounding box coordinates (in x... min ,y min ,x max ,y max The format accurately labels the location of the target object in the image; the third is the category confidence (quantifying the reliability of the pre-labeling results and providing a core basis for subsequent sample stratification). All pre-labeling information is stored in real time in the preset XML format.

[0064] Step S504: Use the category results and bounding box coordinates to determine the pre-labeled samples corresponding to the sample set to be labeled, and use the category confidence to determine the confidence results corresponding to the pre-labeled samples.

[0065] The core pre-annotated information extracted in step S503 is organized and integrated: Using the target object category results and bounding box coordinates as the core, the two are correlated to form a complete pre-annotated sample corresponding to each sample to be annotated, ensuring that each pre-annotated sample contains clear target object identification and location information. Simultaneously, the category confidence score corresponding to each pre-annotated sample is extracted separately, correlated, and archived to form the confidence score result for each pre-annotated sample. Finally, a complete set of pre-annotated samples and their corresponding confidence score results are obtained, providing a clear and reliable classification basis for subsequent differentiated human-machine annotation based on confidence score in step S104, achieving the core objective of batch pre-annotation and confidence score stratification.

[0066] Optionally, the annotation processing strategy corresponding to the pre-labeled samples can be determined based on the confidence results, such as... Figure 6 As shown, it includes: Step S601: Determine the confidence value corresponding to the confidence result based on the category confidence.

[0067] First, from the pre-labeled sample confidence results obtained in step S504, the category confidence score corresponding to each pre-labeled sample is extracted one by one, and its specific confidence score value is determined. This value is the core basis for judging the reliability of the pre-labeled sample and classifying the sample categories. It needs to be accurately correlated with the category results and bounding box coordinates of the corresponding pre-labeled sample to ensure the accuracy of subsequent hierarchical judgment and provide basic data support for the formulation of subsequent annotation processing strategies.

[0068] Step S602: If the confidence value is not less than the first confidence threshold, then determine the first annotation processing strategy corresponding to the pre-annotated sample based on the bounding box coordinates and the category result.

[0069] The first confidence threshold is preset to 0.9. The confidence value determined in step S601 is compared with this threshold. If the confidence value of a pre-labeled sample is not less than 0.9, it indicates that the sample belongs to the high-confidence sample category (sample A). Such samples have high model prediction reliability and low labeling error rate. At this time, based on the bounding box coordinates (x, y, y) of the sample... min ,y min ,x max ,y max Based on the category results, the corresponding first annotation processing strategy is determined, namely: the "rapid manual inspection" mode. This means that the manual staff only needs to quickly check whether the category label and bounding box positioning of the sample are accurate, without having to carry out the complete annotation work again, so as to maximize the annotation efficiency while ensuring the annotation accuracy.

[0070] Step S603: If the confidence value is greater than the second confidence threshold and less than the first confidence threshold, then determine the second annotation processing strategy corresponding to the pre-annotated sample based on the category confidence.

[0071] The second confidence threshold is preset to 0. If the confidence value of a pre-labeled sample is greater than 0 and less than the first confidence threshold (0.9), then the sample belongs to the medium confidence sample category (sample B). The model prediction results for such samples have a certain degree of uncertainty, and there may be problems such as misclassification and bounding box positioning deviation, making it impossible to directly confirm the labeling validity. At this time, based on the class confidence of the sample (the core indicator reflecting the reliability of prediction), the corresponding second labeling processing strategy is determined, namely: temporary storage. Such samples are stored uniformly, and after the basic training model has undergone subsequent iterations and optimizations and the detection accuracy has improved, they are re-inputted into the model for pre-labeling, and the subsequent processing method is determined based on the new confidence results.

[0072] Step S604: If the confidence value is equal to the second confidence threshold, then determine the third annotation processing strategy corresponding to the pre-annotated sample based on the type parameter corresponding to the target object.

[0073] If the confidence value of a pre-labeled sample is equal to the second confidence threshold (0), it means that the sample belongs to the no-confidence sample (sample C), that is, the model has not identified any target object. Such samples may contain target object postures that the model has not learned (such as mobile phones held in special ways, cigarettes with severe obstruction) or complex scenes, and manual intervention is required to ensure the effectiveness of the annotation. At this time, based on the type parameters corresponding to the target object (the posture of the target object, the type standard), the corresponding third annotation processing strategy is determined, namely: the "sampling annotation + residual merging" strategy. A portion of the samples are extracted for complete manual annotation. After the annotation is completed, they are merged into the initial sample set for model iteration and optimization. The remaining unsampled samples are temporarily stored and reprocessed after the model is optimized.

[0074] Optionally, a labeled dataset corresponding to the behavioral data can be obtained using a labeling processing strategy, such as... Figure 7 As shown, it includes: Step S701: When the annotation processing strategy is the first annotation processing strategy, the bounding box corresponding to the pose image is determined by using the bounding box coordinates, and the first annotation dataset corresponding to the behavior data under the category result is determined based on the intersection-union ratio between the bounding box and the target object.

[0075] When the annotation processing strategy is the first annotation processing strategy, the annotation work for the high-confidence sample (sample A) is carried out in the "rapid manual visual inspection" mode. Annotators use a simplified annotation tool, which only displays the sample pose image and the bounding boxes and category labels pre-annotated by the basic training model. Annotators focus on two verification tasks: first, calculating the intersection-over-union (IoU) between the pre-annotated bounding box and the true contour of the target object to determine if the bounding box accurately encloses the target object (IoU must be ≥0.9 with the true target object); second, verifying whether the category label matches the actual type of the target object and the corresponding distraction behavior. After verification, if the sample bounding box deviation is small and the category label is correct, the annotator directly clicks the "confirm" button to include the sample in the annotation dataset; if the bounding box deviation is large or the category label is incorrect, the annotator only fine-tunes the bounding box coordinates or modifies the category label, completing a quick correction before including the sample in the annotation dataset. Finally, this process integrates all qualified samples A to form the first annotation dataset of behavioral data under the corresponding category results.

[0076] Step S702: When the annotation processing strategy is the second annotation processing strategy, determine the second annotation dataset based on the pre-annotated samples, and update the pre-annotated samples to the initial sample set and the unannotated sample set based on the class confidence.

[0077] When the annotation processing strategy is the second annotation processing strategy, the corresponding medium-confidence samples (sample B) adopt the "temporarily stored for iterative processing" mode. Because the model prediction results for such samples have high uncertainty, with issues such as class misclassification and bounding box deviation, direct manual annotation is labor-intensive and inefficient. Therefore, all samples B are temporarily stored as samples to be pre-annotated, and are identified as the unprocessed part of the second annotation dataset. Simultaneously, based on the class confidence of sample B (reflecting the uncertainty of the model prediction), it is updated synchronously to the initial sample set and the unannotated sample set: samples with relatively clear features and confidence close to the first threshold (0.9) are merged into the initial sample set for subsequent model iteration training; the remaining samples remain in the unannotated sample set, waiting for the next round of basic training model iteration optimization before being re-inputted into the model for pre-annotation. This improves model accuracy, transforming some samples into high-confidence samples, reducing manual intervention costs, and they will then be gradually incorporated into the second annotation dataset.

[0078] Step S703: When the annotation processing strategy is the third annotation processing strategy, randomly select the third image set that has been annotated from the pre-annotated samples according to the behavioral data, and determine the third annotation dataset based on the third image set.

[0079] When the annotation processing strategy is the third annotation processing strategy, for the samples with no confidence (sample C), the "sampling annotation + residual merging" strategy is adopted. First, according to the distribution pattern of behavioral data, 500 samples of each of the three target objects (mobile phone, cigarette, and water cup) are randomly selected from sample C (totaling 1500 samples, denoted as sample D). Professional annotators perform full and accurate annotation, detailing the target object's category, pose, bounding box coordinates, and other information, supplementing unseen features not learned by the basic training model (such as target objects with special poses, complex occlusion scenes) to improve the model's generalization ability. This part of the accurately annotated sample D is the third image set that has been annotated. The third image set is integrated to form the third annotation dataset. At the same time, the remaining samples in sample C that were not sampled are directly merged into the medium-confidence samples (sample B) to form a new pre-annotated sample B for the next round of basic training model pre-annotation processing, realizing full utilization of sample resources.

[0080] Optionally, after the step of obtaining the labeled dataset corresponding to the behavioral data using the annotation processing strategy, such as... Figure 8 As shown, the method also includes: Step S801: Determine the labeled sample set corresponding to the basic training model based on the first labeled dataset and the third labeled dataset.

[0081] Based on the first labeled dataset obtained in step S701 and the third labeled dataset obtained in step S703, an integrated labeled sample set is formed to meet the requirements of the basic training model iteration. The first labeled dataset consists of high-confidence samples (sample A), including samples that passed quick manual visual inspection and require no correction, as well as samples where bounding box deviations or category errors were found after visual inspection and were quickly corrected. These samples have high labeling accuracy and reliability. The third labeled dataset consists of samples D (500 images of each of the three target objects, totaling 1500 images) extracted from the no-confidence samples (sample C). These samples were fully and accurately labeled by professional annotators, supplementing unseen features not learned by the model (such as objects with special poses or complex occlusion scenes), effectively improving the model's generalization ability. After integrating and filtering these two datasets, removing duplicate and invalid samples, a standardized, high-quality labeled sample set is formed, providing core data support for model iteration training.

[0082] Step S802: After updating the labeled sample set to the initial sample set, the pre-labeled model is trained using the updated initial sample set to obtain the iterative training model corresponding to the basic training model.

[0083] First, the labeled sample set constructed in step S801 is synchronously updated to both the initial sample set and the sample set to be labeled, expanding and optimizing both sets to form an updated initial sample set. This sample set integrates the features of the original initial samples with newly added high-quality labeled samples, providing a more comprehensive coverage of the target object's pose and scene differences, thus offering richer training data for model iteration. Subsequently, iterative training is conducted using the same training environment and parameter settings as in step S102: a total of 100 training rounds are performed, with validation conducted every 10 rounds. The validation set remains 20% of the updated initial sample set. During training, model performance is monitored in real time, and the model with the best performance (highest labeling accuracy and positioning accuracy) on the validation set is selected as the iterative training model corresponding to the basic training model. This iterative training model will be applied to a new sample B to be pre-labeled, repeating the pre-labeling and differentiated labeling process described above to achieve continuous iterative optimization of the model, gradually improving pre-labeling accuracy and reducing the proportion of manual intervention.

[0084] The above embodiments include a behavioral data annotation processing method for model iteration, such as... Figure 9 As shown, it includes the following steps: Step S901: Image acquisition and sample construction.

[0085] Image acquisition was conducted using a vehicle-mounted high-definition camera (1920×1080 resolution, 30fps) in real-world driving scenarios. Data was collected during four time periods: morning rush hour (7:00-9:00), midday (12:00-14:00), evening rush hour (17:00-19:00), and nighttime (21:00-23:00) to capture data under varying lighting conditions. The acquisition routes included urban main roads, expressways, residential roads, and parking lots, covering both complex backgrounds (such as densely populated areas, roads with numerous billboards, and roads obscured by trees) and simple backgrounds (such as empty parking lots and suburban roads). Target object poses were designed as follows: mobile phones included vertical holding, horizontal placement, and partial obstruction by the steering wheel; cigarettes included handheld and held in the mouth; and water cups included transparent glass cups, thermos cups, and coffee cups. The final dataset consisted of 150,000 images across three categories. 500 samples of each of the three categories of behavior are randomly selected from the original dataset as initial training samples, and are accurately labeled by humans to form the initial (labeled) dataset; the remaining samples are used as pre-labeled samples to be used by the model.

[0086] Step S902: Build the model and obtain the basic training model.

[0087] The core of this step is to construct a lightweight pre-labeled model to enhance small object detection, and to pre-train it using an initial labeled dataset to obtain a basic model with object recognition capabilities. The model is based on the YOLOv8 framework, with core optimizations including replacing the backbone network with a customized MobileNetV3-Small, and adding a 160×160 scale feature branch and a corresponding small object detection head, as detailed below. Figure 10 As shown.

[0088] The backbone network adopts a MobileBottleneck hierarchical structure, achieving a balance between lightweight design and feature extraction capabilities through depthwise separable convolutions, channel attention mechanisms, and dynamic nonlinear activation functions. This backbone network directly outputs feature maps at four scales: 160×160, 80×80, 40×40, and 20×20. The 160×160 scale feature map (P2) is directly used for detecting smaller targets, retaining richer detail information compared to feature maps generated by traditional upsampling.

[0089] A new small target detection head was added to the 160×160 feature map to enhance the detection accuracy of smaller targets such as cigarettes. This detection head adopts a two-layer structure of "feature enhancement + accurate prediction": the first layer is a depthwise separable convolutional module (consisting of a 3×3 depthwise convolution with padding=1 and a 1×1 pointwise convolution) used to enhance local features of the 160×160 high-resolution feature map and capture subtle contour information of the target; the second layer is a 1×1 convolutional layer responsible for outputting class prediction and bounding box prediction results. The class prediction branch uses the BCE loss function to optimize classification accuracy, and the bounding box prediction branch uses the CIoU loss function to optimize localization error. Simultaneously, to reduce the computational overhead of the high-resolution feature map, a depthwise separable convolutional module is introduced after the convolutional layer, reducing the number of parameters while maintaining feature representation capabilities. The small target detection head (corresponding to the 160×160 feature map) and the original detection heads (corresponding to 80×80, 40×40, and 20×20 feature maps) form a multi-scale detection system, covering targets of different size ranges.

[0090] Step S903: Batch inference and confidence stratification of pre-labeled samples based on the basic training model.

[0091] The goal of this step is to use the basic trained model to perform batch inference on the samples to be labeled, output preliminary labeling results, and classify the samples based on confidence levels, so as to provide a basis for subsequent differentiated human-machine labeling.

[0092] During the batch inference process of pre-labeled samples, the pre-labeled data to be used by the model is input into the basic training model obtained by S2 for batch inference to obtain pre-labeled samples. During inference, a confidence threshold of 0.5 is set to filter obviously erroneous bounding boxes; the IoU (Intersection over Union) threshold for NMS (Non-Maximum Suppression) is set to 0.7 to avoid the same target being labeled multiple times. The pre-labeling results are stored in an XML file in PASCAL VOC format, with each label containing the target category, bounding box coordinates (x, y, y). min ,y min ,x max ,y max The three core parameters—including category confidence—facilitate subsequent sample classification and manual processing.

[0093] Pre-labeled sample confidence stratification divides pre-labeled samples into three categories based on class confidence: High-confidence samples (sample A): Class confidence ≥ 0.9. These samples have high model prediction reliability and low labeling error rate; Medium confidence samples (sample B): 0 < class confidence < 0.9. The model prediction results for this type of sample have certain uncertainties, and there may be class misclassification or bounding box bias issues, which require further processing. No confidence sample (sample C): No category confidence output, meaning the model did not identify any target object. Such samples may contain target object poses or scenes that the model has not learned, requiring manual intervention.

[0094] Step S904: Differentiated human-machine collaborative annotation.

[0095] For the category confidence features of the three types of samples, a differentiated human-computer interaction annotation strategy is designed to maximize efficiency while ensuring annotation accuracy. Sample A: Quick manual visual inspection mode; the annotator uses a simplified annotation tool, which only displays the bounding boxes and category labels of the image and model pre-annotated. The annotator focuses on verifying two things: (1) whether the bounding box accurately surrounds the target object (IoU with the real target object ≥ 0.9); (2) whether the category label is correct. For samples with small bounding box deviations and correct categories, click the "Confirm" button directly to include them in the annotation dataset; for samples with large bounding box deviations or incorrect categories, the annotator performs quick corrections (only adjusts the bounding box coordinates or modifies the category label), and includes them in the annotation dataset after correction.

[0096] Sample C: Sampling labeling + residual merging strategy; 500 samples of each of the three target objects are randomly selected from Sample C (1500 samples in total, denoted as Sample D), and professional labelers perform full and accurate labeling to supplement the unseen features that the model has not learned, so as to improve the model's generalization ability; the remaining samples in Sample C are directly merged into Sample B to form a new sample B to be pre-labeled by the model, which will be used for the next round of pre-labeling.

[0097] Sample B: Temporarily stored for iterative processing strategy; The model prediction results of Sample B have high uncertainty, and the workload of direct manual annotation is large. Therefore, it is temporarily stored as a sample to be pre-labeled. After the next round of model iteration and optimization, it will be re-labeled. By improving the model accuracy, some samples will be transformed into high-confidence samples, reducing manual intervention.

[0098] Step S905: Model iteration processing.

[0099] The goal of this step is to optimize the model using newly added labeled data, achieving a closed loop of "pre-labeling - human-computer processing - model iteration - re-pre-labeling", and continuously improving the accuracy and efficiency of model pre-labeling.

[0100] During the construction of the labeled sample set, the labeled sample set is formed by collecting visually inspected and corrected samples from sample A and manually labeled samples from sample D.

[0101] During the model iterative training process, the labeled dataset is merged with the initial labeled dataset to form the total training dataset. The basic training model is iteratively trained using the same training environment and parameter settings as in step S902. The training rounds are 100, and the model is validated every 10 rounds. The model with the best performance on the validation set is selected as the iterative training model.

[0102] In the closed-loop iteration process, the iteratively trained model is applied to a new sample B to be pre-labeled, and steps S903-S905 are repeated to achieve model iteration. In the long run, as the model pre-labeling accuracy improves, the proportion of manual intervention gradually decreases, and labor costs continue to decrease, balancing short-term labeling needs with long-term economic efficiency.

[0103] As demonstrated by the above behavioral data annotation processing method, this method, through innovative network structure design, can more accurately identify small targets and complex poses, significantly improving annotation accuracy. Combining automatic batch pre-annotation with differentiated human-machine annotation strategies, it greatly improves annotation efficiency while ensuring quality, effectively reducing manpower and hardware costs. In addition, the model has good scene adaptability, can cope with the challenges of different lighting and backgrounds, and ensures the consistency of annotation data through standardized processes, ultimately providing high-quality and highly reliable training data for the distraction behavior recognition model.

[0104] Corresponding to the above embodiments of the behavior data annotation and processing method, this embodiment of the invention also provides a behavior data annotation and processing system, such as... Figure 11 As shown, the system includes: Sample set construction module 100: used to collect posture images of target objects in vehicles, and construct the initial sample set and the sample set to be labeled of the target objects based on the behavioral data of the posture images; Basic training model construction module 200: used to build a pre-labeled model based on the target object, and to obtain the basic training model after training the pre-labeled model using the initial sample set; wherein, the backbone network of the pre-labeled model is a hierarchical structure; Pre-labeled sample acquisition module 300: Used to acquire pre-labeled samples and their corresponding confidence results corresponding to the sample set to be labeled using the basic training model; The labeled dataset determination module 400 is used to determine the labeling processing strategy corresponding to the pre-labeled samples based on the confidence results, and to obtain the labeled dataset corresponding to the behavioral data using the labeling processing strategy.

[0105] This invention also provides another behavioral data annotation and processing system, such as... Figure 12 The following are included: Sample set construction module 100: Equipped with a vehicle-mounted high-definition camera and data storage unit, it is used to collect image data of different time periods, different routes, and different target postures; it has data classification and filtering functions, and can extract samples from the original dataset to construct the initial labeled dataset, and mark the remaining samples as samples to be pre-labeled.

[0106] The basic training model building module 200 includes a network building unit, a parameter configuration unit, and a training execution unit. The network building unit is used to build a customized MobileNetV3-Small backbone network, add a small object detection head, and a multi-scale detection system. The parameter configuration unit is used to set parameters such as batch size, learning rate, and loss function for model training. The training execution unit executes the model training process based on the PyTorch framework, monitors the training process, and outputs the basic training model.

[0107] The pre-labeled sample acquisition module 300 includes a data loading unit, a batch inference unit, and a result storage unit. The data loading unit is used to read samples to be pre-labeled in batches. The batch inference unit calls the basic training model to perform parallel inference and applies confidence threshold and NMS threshold to filter valid labeling results. The result storage unit stores the pre-labeling results in an XML file in PASCAL VOC format and associates it with the image file.

[0108] Uncertainty assessment module 500: Calculates the category confidence of pre-labeled samples through the confidence analysis unit, divides the samples into three categories of confidence samples (high, medium, and none) according to the confidence threshold, and stores them in the corresponding directories to provide a basis for hierarchical human-machine annotation.

[0109] The labeled dataset determination module 400 provides differentiated labeling tools and processes, including a rapid visual inspection tool (for sample A processing), a precise labeling tool (for sample D processing), and a sample merging unit (for merging the remaining part of sample C into sample B); it also has labeling result review and quality control functions to ensure the accuracy of labeled data.

[0110] Model Iteration Module 600: Used to merge labeled datasets with the initial sample set, configure iterative training parameters, execute the model iterative training process, and output a new iteratively trained model; supports model version management and can backtrack the model performance at different iteration stages.

[0111] Data transmission and interaction between the modules are achieved through data interfaces. The dataset output by the sample set construction module 100 is transmitted to the basic training model construction module 200. The basic training model output by the basic training model construction module 200 is transmitted to the pre-labeled sample acquisition module 300. The classification samples output by the pre-labeled sample acquisition module 300 are transmitted to the uncertainty evaluation module 500. The classification results output by the uncertainty evaluation module 500 trigger the corresponding labeling process of the label dataset determination module 400. The labeled dataset output by the label dataset determination module 400 is transmitted to the model iteration module 600. The iteratively trained model output by the model iteration module is fed back to the pre-labeled sample acquisition module 300, forming a closed loop operation.

[0112] As can be seen from the above behavior data annotation and processing system, through innovative model structure design, the system can accurately identify small targets and their complex postures, significantly improving the annotation accuracy of behavior data. In addition, the system can achieve automatic batch annotation and differentiated human-machine annotation of behavior data by combining annotation processing strategies with confidence results, which greatly improves annotation efficiency while ensuring the annotation effect of behavior data.

[0113] The behavioral data annotation and processing system provided in this embodiment of the invention has the same implementation principle and technical effects as the aforementioned behavioral data annotation and processing method embodiment. For the sake of brevity, any parts not mentioned in the system embodiment can be referred to the corresponding content in the aforementioned behavioral data annotation and processing method embodiment.

[0114] This embodiment also provides an electronic device, the structural schematic diagram of which is shown below. Figure 13As shown, the device includes a processor 101 and a memory 102; wherein, the memory 102 is used to store one or more computer instructions, which are executed by the processor to implement the steps of the above-described behavioral data annotation processing method.

[0115] Figure 13 The electronic device shown also includes a bus 103 and a communication interface 104, with the processor 101, communication interface 104 and memory 102 connected via the bus 103.

[0116] The memory 102 may include high-speed random access memory (RAM) and may also include non-volatile memory, such as at least one disk storage device. The bus 103 may be an ISA bus, PCI bus, or EISA bus, etc. The bus can be divided into address bus, data bus, control bus, etc. For ease of representation, Figure 13 The symbol is represented by a single double-headed arrow, but this does not mean that there is only one bus or one type of bus.

[0117] The communication interface 104 is used to connect to at least one user terminal and other network units through a network interface, and to send encapsulated IPv4 packets or IPv4 packets to the user terminal through the network interface.

[0118] Processor 101 may be an integrated circuit chip with signal processing capabilities. In implementation, each step of the above method can be completed by the integrated logic circuitry in the hardware of processor 101 or by instructions in software form. The processor 101 can be a general-purpose processor, including a Central Processing Unit (CPU), a Network Processor (NP), etc.; it can also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field-Programmable Gate Array (FPGA), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components. It can implement or execute the methods, steps, and logic block diagrams disclosed in the embodiments of this disclosure. The general-purpose processor can be a microprocessor or any conventional processor. The steps of the methods disclosed in the embodiments of this disclosure can be directly manifested as execution by a hardware decoding processor, or execution by a combination of hardware and software modules in the decoding processor. The software module can reside in a mature storage medium in the art, such as random access memory, flash memory, read-only memory, programmable read-only memory, electrically erasable programmable memory, or registers. This storage medium is located in memory 102. The processor 101 reads the information in memory 102 and, in conjunction with its hardware, completes the steps of the method described in the foregoing embodiments.

[0119] In the several embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, devices, and methods can be implemented in other ways. The system embodiments described above are merely illustrative. For example, the division of units is only a logical functional division, and in actual implementation, there may be other division methods. Furthermore, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Additionally, the coupling or direct coupling or communication connection shown or discussed may be through some communication interfaces; the indirect coupling or communication connection between devices or units may be electrical, mechanical, or other forms.

[0120] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.

[0121] In addition, the functional units in the various embodiments of the present invention can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit.

[0122] If the aforementioned functions are implemented as software functional units and sold or used as independent products, they can be stored in a processor-executable, non-volatile, computer-readable storage medium. Based on this understanding, the technical solution of this invention, essentially, or the part that contributes to the prior art, or a portion of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, electronic device, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0123] Finally, it should be noted that the above-described embodiments are merely specific implementations of the present invention, used to illustrate the technical solutions of the present invention, and not to limit it. The scope of protection of the present invention is not limited thereto. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that any person skilled in the art can still modify or easily conceive of changes to the technical solutions described in the foregoing embodiments within the technical scope disclosed in the present invention, or make equivalent substitutions for some of the technical features; and these modifications, changes, or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention, and should all be covered within the scope of protection of the present invention. Therefore, the scope of protection of the present invention should be determined by the scope of the claims.

Claims

1. A method for labeling and processing behavioral data, characterized in that, The method includes: Acquire posture images of target objects in a vehicle, and construct an initial sample set and a sample set to be labeled for the target objects based on the behavioral data of the posture images; A pre-labeled model is constructed based on the target object, and a basic training model is obtained by training the pre-labeled model using the initial sample set; wherein, the backbone network of the pre-labeled model is a hierarchical structure. The pre-labeled samples and their corresponding confidence scores are obtained using the basic training model. Based on the confidence result, the annotation processing strategy corresponding to the pre-labeled sample is determined, and the annotation processing strategy is used to obtain the labeled dataset corresponding to the behavioral data.

2. The behavioral data annotation and processing method according to claim 1, characterized in that, The steps of acquiring attitude images of target objects in a vehicle and constructing an initial sample set and a sample set to be labeled for the target objects based on the behavioral data of the attitude images include: Identify the target objects contained in the cockpit area of ​​the target vehicle; wherein the target objects include at least a mobile phone, cigarettes, and a water cup; When the target vehicle is detected to be traveling on the target route, multiple posture images corresponding to the target object are collected and acquired using the camera components deployed in the target vehicle. The behavior data of the posture image is determined according to the type parameter corresponding to the target object. A first image set is randomly selected from the posture image according to the behavior data, and a second image set corresponding to the first image set is determined using the posture image. After labeling the first image set using the type parameters, an initial sample set corresponding to the target object is obtained; Based on the second image set, determine the sample set to be labeled corresponding to the target object.

3. The behavioral data annotation and processing method according to claim 1, characterized in that, Constructing a pre-labeled model based on the target object includes: The target detection model is initialized using the type parameters corresponding to the target object, and the backbone network, neck network, and head network corresponding to the target detection model are obtained. The backbone network is updated using multiple sequentially connected convolutional neural network modules; wherein the backbone network acquires a first-scale feature map and a second-scale feature map according to preset downsampling parameters; wherein the resolution of the first-scale feature map is greater than the resolution of the second-scale feature map; the first-scale feature map is used to detect the first behavioral data corresponding to the first target object, the second-scale feature map is used to detect the second behavioral data corresponding to the second target object, and the size of the first target object is smaller than the size of the second target object; The first target detection head and the second target detection head corresponding to the target detection model are constructed based on the first scale feature map and the second scale feature map, respectively, and the neck network and the head network are updated through the first target detection head and the second target detection head; The pre-labeled model is constructed based on the updated backbone network, neck network, and head network; wherein the pre-labeled model is used to determine the behavioral data through the first scale feature map and the second scale feature map.

4. The behavioral data annotation and processing method according to claim 3, characterized in that, The base training model is obtained by training the pre-labeled model using the initial sample set, including: The first convolutional module, the second convolutional module, and the third convolutional module corresponding to the first target detection head in the pre-labeled model are determined; wherein, the first convolutional module is used to perform feature enhancement processing on the first scale feature map; the second convolutional module is used to obtain the category prediction result and the bounding box prediction result corresponding to the first behavior data; and the third convolutional module is used to update the parameter quantity corresponding to the first target detection head. Based on the prediction results, the category classification loss function corresponding to the pre-labeled model is determined, and based on the bounding box prediction results, the bounding box regression loss function corresponding to the pre-labeled model is determined. The training parameters corresponding to the pre-labeled model are determined by the first convolutional module, the second convolutional module, and the third convolutional module. The pre-labeled model is trained using the initial sample set and according to the training parameters, and the loss value corresponding to the pre-labeled model is obtained in real time using the category classification loss function and the bounding box regression loss function. When the loss value meets the preset loss threshold relationship, the training process of the pre-labeled model is stopped, and the current pre-labeled model is determined as the basic training model.

5. The behavioral data annotation and processing method according to claim 1, characterized in that, The steps for obtaining the pre-labeled samples and their corresponding confidence results corresponding to the sample set to be labeled using the basic training model include: Based on the type parameters corresponding to the target object, determine the confidence threshold and intersection-over-union (IoU) threshold for the sample set to be labeled; The batch inference strategy corresponding to the basic training model is determined based on the confidence threshold and the intersection-union ratio threshold. The sample set to be labeled is input into the basic training model. The batch inference strategy controls the basic training model to output the pre-labeled information corresponding to the target object, and the category result, bounding box coordinates and category confidence of the target object contained in the pre-labeled information are obtained. The pre-labeled samples corresponding to the sample set to be labeled are determined using the category results and the bounding box coordinates, and the confidence results corresponding to the pre-labeled samples are determined using the category confidence.

6. The behavioral data annotation and processing method according to claim 5, characterized in that, Based on the confidence results, the annotation processing strategy corresponding to the pre-labeled samples is determined, including: The confidence value corresponding to the confidence result is determined based on the category confidence. If the confidence value is not less than the first confidence threshold, then the first annotation processing strategy corresponding to the pre-annotated sample is determined based on the bounding box coordinates and the category result; If the confidence value is greater than the second confidence threshold and less than the first confidence threshold, then the second labeling processing strategy corresponding to the pre-labeled sample is determined based on the category confidence. If the confidence value is equal to the second confidence threshold, then the third annotation processing strategy corresponding to the pre-annotated sample is determined based on the type parameter corresponding to the target object.

7. The behavioral data annotation and processing method according to claim 6, characterized in that, The annotation processing strategy is used to obtain the labeled dataset corresponding to the behavior data, including: When the annotation processing strategy is the first annotation processing strategy, the bounding box coordinates are used to determine the bounding box corresponding to the pose image, and the intersection-over-union ratio between the bounding box and the target object is used to determine the first annotation dataset corresponding to the behavior data under the category result. When the annotation processing strategy is the second annotation processing strategy, the second annotation dataset is determined based on the pre-annotated samples, and the pre-annotated samples are updated to the initial sample set and the sample set to be annotated based on the class confidence. When the annotation processing strategy is the third annotation processing strategy, a third image set that has been annotated is randomly selected from the pre-annotated samples according to the behavioral data, and a third annotation dataset is determined based on the third image set.

8. The behavioral data annotation and processing method according to claim 7, characterized in that, After the step of obtaining the labeled dataset corresponding to the behavior data using the labeled processing strategy, the method further includes: Based on the first labeled dataset and the third labeled dataset, determine the labeled sample set corresponding to the basic training model; After updating the labeled sample set to the initial sample set, the pre-labeled model is trained using the updated initial sample set to obtain the iterative training model corresponding to the basic training model.

9. A behavioral data annotation and processing system, characterized in that, The system includes: Sample set construction module: used to collect posture images of target objects in vehicles, and construct an initial sample set and a sample set to be labeled for the target objects based on the behavioral data of the posture images; Basic training model construction module: used to construct a pre-labeled model based on the target object, and to obtain a basic training model by training the pre-labeled model using the initial sample set; wherein, the backbone network of the pre-labeled model is a hierarchical structure; Pre-labeled sample acquisition module: used to acquire pre-labeled samples and their corresponding confidence results corresponding to the sample set to be labeled using the basic training model; The labeled dataset determination module is used to determine the labeling processing strategy corresponding to the pre-labeled sample based on the confidence result, and to obtain the labeled dataset corresponding to the behavioral data using the labeling processing strategy.

10. An electronic device, characterized in that, The electronic device includes a processor and a memory, the memory storing computer-executable instructions that can be executed by the processor, the processor executing the computer-executable instructions to implement the steps of the behavioral data annotation processing method according to any one of claims 1 to 8.