A Pedestrian Detection Method for Intelligent Connected Vehicles Based on Improved YOLO v3
By improving the YOLO v3 model and combining edge computing and cloud server collaborative training, a pedestrian detection model was generated, which solved the problems of slow pedestrian detection speed and low accuracy in intelligent connected vehicles, improved the real-time performance and accuracy of detection, and enhanced road driving safety.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- JIANGSU UNIV
- Filing Date
- 2023-03-24
- Publication Date
- 2026-06-30
AI Technical Summary
In existing technologies, intelligent connected vehicles may cause traffic accidents by ignoring pedestrians during driving due to limited or obstructed driver vision. Furthermore, pedestrian detection models are insufficient in terms of detection speed and accuracy.
An improved YOLO v3 model is adopted and trained collaboratively by edge computing nodes and cloud servers to generate a pedestrian detection model. The K-means++ clustering algorithm is used to generate prior boxes, and the improved YOLO v3 model is deployed on edge computing nodes and cloud servers respectively for feature extraction and fusion. The hybrid domain attention mechanism CBAM is used to improve feature extraction capability, and edge-cloud collaborative training is combined to improve detection accuracy.
It improves the real-time performance and accuracy of pedestrian detection, reduces the probability of missed or false detections of small pedestrians, and enhances the road safety of intelligent connected vehicles.
Smart Images

Figure CN116229424B_ABST
Abstract
Description
Technical Field
[0001] This invention pertains to vehicle networking technology, specifically relating to a pedestrian detection method for intelligent connected vehicles based on an improved YOLO v3. Background Technology
[0002] With the development of computer science and technology and the gradual improvement of social living standards, the number of vehicles on the road is constantly increasing. Pedestrian detection, as a key technology in the fields of autonomous driving and vehicle-assisted driving, plays an important role in road driving safety. However, in the context of complex road traffic, drivers may not have enough time to react due to factors such as limited visibility, leading to traffic accidents. Using computer vision technology to improve the pedestrian detection rate helps drivers to understand the pedestrian situation on the road in advance and allow sufficient time to adjust the driving direction; using edge computing nodes can improve the real-time performance of pedestrian detection; using cloud servers to cooperate with edge computing nodes for model training compensates for the insufficient computing power of edge computing nodes; at the same time, in intelligent connected vehicles, it can also enable the control system to capture more pedestrian information to correctly plan the driving direction, thereby improving road driving safety. Summary of the Invention
[0003] Purpose of the invention: The purpose of this invention is to address the shortcomings of existing technologies and provide an intelligent connected vehicle pedestrian detection method based on an improved YOLO v3. This invention can solve the problem of accidents caused by drivers ignoring pedestrians during driving due to limited vision or obstructed line of sight.
[0004] Technical solution: The present invention provides a pedestrian detection method for intelligent connected vehicles based on an improved YOLO v3, comprising the following steps:
[0005] Step S100: Send road video data to the edge computing node;
[0006] Step S200: The edge computing node preprocesses the received road video data to obtain the corresponding RGB image dataset, and uploads the RGB image dataset to the cloud server;
[0007] Step S300: Edge computing nodes use the K-means++ clustering algorithm to generate prior boxes;
[0008] Step S400: Generate a pedestrian detection model. The specific process is as follows:
[0009] Step S401: Improve the YOLO v3 model network. The improved YOLO v3 model network includes 27 CBL modules, 3 upsampling modules, 3 feature fusion modules, 7 CBAM modules, and 4 detection heads.
[0010] Step S402: Segment the improved YOLO v3 model network and deploy the segmented improved YOLO v3 model network on edge computing nodes and cloud servers respectively; wherein, on the edge computing nodes, an input layer for feature extraction, one CBL module and one residual block are constructed; on the cloud server, four residual blocks and 27 CBL modules for feature extraction and feature fusion are constructed.
[0011] Step S403: Use edge-cloud collaboration to train the model based on transfer learning: Train the model multiple times on the cloud server based on the road pedestrian dataset (e.g., the Caltech pedestrian dataset) to obtain the optimal weights, use them as the base model, and deploy them to the edge computing nodes. At the same time, use the base model as the initial parameters to train on the edge computing nodes; aggregate the optimal weights obtained from training on the cloud server on the edge computing nodes and generate a pedestrian detection model.
[0012] Step S500: Evaluate the pedestrian detection model and detection performance obtained in step S403 on the edge computing nodes:
[0013] Step S501: Use the pedestrian detection model on the edge computing node to perform pedestrian detection on the test set, and calculate the detection accuracy of the model according to the PASCAL VOC standard.
[0014] Step S502: Use the weight with the highest detection accuracy for pedestrian detection;
[0015] In step S600, the edge computing node performs pedestrian detection and uses the roadside communication unit (RSU) to broadcast the pedestrian detection results to the intelligent connected vehicle.
[0016] Furthermore, the detailed process of step S100 is as follows:
[0017] Step S101: Use the onboard camera of the intelligent connected vehicle to capture road video data while driving on the road;
[0018] Step S102: The intelligent connected vehicle compresses the collected road video data;
[0019] Step S103: The intelligent connected vehicle sends the compressed road video data to the edge computing node using Vehicle-to-Network (V2N) technology.
[0020] Furthermore, the detailed process of step S200 includes:
[0021] Step S201: The edge computing node uses the RTSP protocol to receive road video data;
[0022] Step S202: The edge computing node converts the road video data format into PASCAL VOC format image data;
[0023] Step S203: The edge computing node uses median filtering and gradient method to perform image preprocessing, denoising, brightness adjustment and converting the image data obtained in step S202 into an RGB image;
[0024] Step S204: The edge computing node uploads the processed RGB image dataset to the cloud server for data storage.
[0025] Further, in step S300, 12 prior boxes are obtained by clustering based on the K-mens++ clustering algorithm. Then, the intersection-union function (IOU) between the ground truth boxes and the bounding boxes is used as the distance metric. The distance metric result with an accuracy of 83.20% is used as the final value of the prior box, namely [6, 14], [8, 17], [8, 23], [10, 20], [11, 25], [11, 34], [14, 29], [15, 38], [18, 47], [23, 58], [33, 87] and [62, 158];
[0026] The distance metric is calculated as follows:
[0027] distance(box,centroid)=1-IOU(box,centroid) (1)
[0028] In the above formula, box represents bounding boxes; centroid represents anchor boxes, i.e., the cluster centers of the bounding boxes; IOU represents the intersection-union ratio of bounding boxes and anchor boxes;
[0029] The IOU is the intersection-union ratio of the actual bounding box and the cluster center, and it is calculated as follows:
[0030]
[0031] In the above formula, Gt (Ground Truth) represents the ground truth bounding box of the target, and Dr (Detection Result) represents the predicted bounding box of the target.
[0032] The logic of the K-mens++ algorithm in this invention is as follows:
[0033] (1) Select K points that are sufficiently far apart as cluster centers;
[0034] (2) Calculate the distance between each point in the dataset and the cluster center;
[0035] (3) Select the next cluster center using the roulette wheel method;
[0036] (4) Calculate the mean IOU of each cluster and update the cluster center.
[0037] Furthermore, for the improved YOLO v3 model network in step S401,
[0038] Each CBL module includes a two-dimensional convolution (Darknet Conv2D) structure, batch normalization, and a non-linear activation function (Leaky ReLU) convolution;
[0039] Each upsampling module uses 3×3 and 1×1 convolutions to change the size of the feature layer;
[0040] The three feature fusion modules stack the output feature layer of the second-to-last residual block of the backbone network with the output feature layer of the first upsampling module, stack the output feature layer of the third-to-last residual block with the output feature layer of the second upsampling module, and stack the output feature layer of the fourth-to-last residual block with the output feature layer of the third upsampling module.
[0041] The seven CBAM modules employ a hybrid domain attention mechanism to score both channel attention and spatial attention simultaneously, thereby reducing the loss of effective information.
[0042] The four detection heads are output feature layers of 13×13×18, 26×26×18, 52×52×18 and 104×104×18 respectively.
[0043] The improved YOLO v3 model network's backbone network, based on Darknet53, adds a Hybrid Domain Attention (CBAM) mechanism, including spatial attention and channel attention. Specifically, the CBAM modules are directly inserted between residual blocks in the backbone network to enhance feature extraction performance. The input CBAM feature layers are 208×208×64, 104×104×128, 52×52×256, and 26×26×512. The backbone network adds an upsampling step, transforming the 52×52×128 feature layer into a 104×104×64 feature layer. The 104×104×128 feature layer output from the second residual block of the backbone network is then fused with the 104×104×64 feature layer using a similar FBN approach, followed by convolution, resulting in four detector heads at the final output. After feature fusion output in the backbone part, three CBAM modules are added to improve feature extraction accuracy.
[0044] Furthermore, the detailed process of edge computing nodes and cloud servers performing edge-cloud collaborative training in step S403 is as follows:
[0045] First, the data is augmented, and convolution operations are performed on edge computing nodes to obtain a 208×208×64 residual block; (The aforementioned data augmentation method is: using the image as a 416×416 base, two proportions are randomly obtained from the image within a small range to serve as the width and height of the newly generated image; it is then placed on a 416×416 canvas, and gray bars are used to fill the blank areas)
[0046] Then, the convolution results are uploaded to the cloud server to complete the feature extraction and three-stage feature fusion of Darknet53; during the training process, the training set is input into the improved YOLO v3 model in batches for forward propagation and the loss value of the parameters is calculated.
[0047] Finally, four detectors are output, and the training parameters are adjusted through backpropagation.
[0048] The loss values calculated here include: predicted box loss, confidence loss, and class loss; the predicted box loss uses the sum of squared errors loss, and the confidence loss and class loss use binary cross-entropy, as shown in the following formulas:
[0049]
[0050] In the above formula, λ coord It is the penalty coefficient for coordinate prediction, λ noobj This is the penalty coefficient for confidence without a target, S×S is the number of grids, M is the number of bounding boxes predicted for each grid, and t x t y t w and t h The x-coordinate, y-coordinate, width, and height of the predicted target center are represented by t. ′ x t ′ y t ′ w and t ′ h The x-coordinate, y-coordinate, width, and height represent the true center of the target. The i-th grid cell representing the location of the j-th candidate box's target border is used to inspect the item. The item is not checked in the i-th grid cell where the target border of the j-th candidate box is located. i c represents the prediction confidence level belonging to a certain category. i ′ represents the true confidence level belonging to a certain category, P i (c) Table P represents the predicted probability that the target in the i-th grid belongs to a certain category.i ′ (c) represents the true probability value of the target in the i-th grid belonging to a certain category, where c represents the category and classes represents the total number of categories.
[0051] Here, Leaky ReLU is used as the activation function during forward propagation. The Leaky ReLU activation function performs a non-linear transformation on the output of the previous feature layer. The Leaky ReLU expression is as follows:
[0052]
[0053] In the above formula, α is a fixed parameter within (0,1), and x is the output value;
[0054] The target detection results are filtered using nonmaximum suppression (NMS) based on the confidence parameter.
[0055] Further, the PASCAL VOC standard in step S501 includes:
[0056] mAP, Precision, Recall, and F1 are calculated as follows:
[0057]
[0058]
[0059] in, yes Precision at that time.
[0060]
[0061] Among them, True Positive (TP): positive samples are predicted as positive samples; False Negative (FN): positive samples are predicted as negative samples; False Positive (FP): negative samples are predicted as positive samples; True Negative (TN): negative samples are predicted as negative samples.
[0062]
[0063] Harmonic Mean
[0064] Furthermore, the detailed method of step S502 is as follows:
[0065] The model with the highest mAP result was selected as the pedestrian detection result.
[0066] The steps of pedestrian detection are as follows: (1) read the model parameters; (2) input the detection data; (3) use the prior boxes of K-means++ clustering as the initial boxes for pedestrian detection, decode the target boxes, and obtain the coordinates of the pedestrian center point and the width and height parameters.
[0067] The method for decoding the target bounding box is as follows: resize the image without distortion; normalize the image; generate prior bounding box parameters (x, y, w, h); generate prior bounding box adjustment parameters; and return the output result as the predicted bounding box.
[0068] Beneficial Effects: This invention utilizes an improved YOLO v3 model with edge-cloud collaborative training to effectively address the issues of slow pedestrian detection speed and low model accuracy in intelligent connected vehicle scenarios. Using edge computing nodes improves the real-time performance of pedestrian detection; using cloud servers to collaborate with edge computing nodes for model training compensates for the insufficient computing power of edge computing nodes; simultaneously, in intelligent connected vehicles, the control system can capture more pedestrian information to correctly plan driving directions, thereby improving road safety. The improvements to the YOLO v3 algorithm in this invention enhance the accuracy of pedestrian detection and reduce the problems of missed and false detections of small pedestrians. Attached Figure Description
[0069] Figure 1 This is a flowchart of pedestrian detection in one embodiment of the present invention;
[0070] Figure 2 This is a pedestrian detection structure diagram in one embodiment of the present invention;
[0071] Figure 3 This is a diagram of the improved YOLO v3 model network structure in one embodiment of the present invention;
[0072] Figure 4 This is a diagram of the existing YOLO v3 model network structure. Detailed Implementation
[0073] The technical solution of the present invention will be described in detail below, but the scope of protection of the present invention is not limited to the embodiments described.
[0074] Example 1:
[0075] like Figure 1 As shown, the pedestrian detection method for intelligent connected vehicles based on the improved YOLO v3 in this embodiment includes the following steps:
[0076] Step S100: Send road video data to the edge computing node;
[0077] Step S200: The edge computing node preprocesses the received road video data to obtain the corresponding RGB image dataset, and uploads the RGB image dataset to the cloud server for data storage.
[0078] Step S300: Edge computing nodes use the K-means++ clustering algorithm to generate prior boxes;
[0079] Step S400: Generate a pedestrian detection model. The specific process is as follows:
[0080] Step S401: Improve the YOLO v3 model network. The improved YOLO v3 model network includes 27 CBL modules, 3 upsampling modules, 3 feature fusion modules, 7 CBAM modules, and 4 detection heads.
[0081] Step S402: Segment the improved YOLO v3 model network and deploy the segmented improved YOLO v3 model network on edge computing nodes and cloud servers respectively; wherein, on the edge computing nodes, an input layer for feature extraction, one CBL module and one residual block are constructed; on the cloud server, four residual blocks and 27 CBL modules for feature extraction and feature fusion are constructed.
[0082] Step S403: Use edge-cloud collaboration to train the model based on transfer learning: Train the model multiple times on the cloud server based on the publicly available road pedestrian dataset to obtain the optimal weights, use it as the base model, and deploy it to the edge computing node. At the same time, use the base model as the initial parameters to train on the edge computing node; aggregate the model parameters trained by edge-cloud collaboration on the edge computing node and generate a pedestrian detection model.
[0083] Step S500: Evaluate the pedestrian detection model and detection performance obtained in step S403 on the edge computing nodes:
[0084] Step S501: Use the pedestrian detection model on the edge computing node to perform pedestrian detection on the test set, and calculate the detection accuracy of the model according to the PASCAL VOC standard.
[0085] Step S502: Select a model based on its detection accuracy using a voting mechanism, and use it for pedestrian detection;
[0086] In step S600, the edge computing node performs pedestrian detection and uses the roadside communication unit (RSU) to broadcast the pedestrian detection results to the intelligent connected vehicle.
[0087] Example 2:
[0088] like Figure 2As shown, the detection process in this embodiment is the same as in Embodiment 1. The pedestrian detection system structure includes a road video data acquisition terminal, an edge computing node, and a cloud server. The road video data acquisition terminal consists of an intelligent connected vehicle's onboard camera, which collects road video data during driving and then communicates with the roadside communication unit (RSU) via V2N. After the intelligent connected vehicle's onboard camera collects data, it broadcasts the data to the RSU in real time. After receiving the video data, the RSU transmits the data to the edge computing node based on the RTSP protocol. The edge computing node uses basic image processing algorithms to preprocess the image. The preprocessing includes: noise reduction, brightness adjustment, and conversion to RGB images. The edge-cloud collaborative model network based on the improved YOLO v3 model is used to train the image data, generating a model at the edge computing node and applying PASCAL. The VOC standard evaluates the model and determines the model weights for final application; pedestrian detection is performed at the edge computing node using the model weights, and the pedestrian detection results are sent to the Roadside Communication Unit (RSU); the RSU then broadcasts the results to the intelligent connected vehicle terminal; the edge computing node includes an edge computing software unit, an intelligent processing unit, an embedded operating system, and an edge computing hardware unit, wherein the edge node intelligent processing unit is the core unit of this invention, realizing image preprocessing functions, clustering anchor boxes, and completing part of the training tasks of the improved YOLO v3; the cloud server realizes the storage and training of global data, and also undertakes the computation of the edge node model training part, and aggregates the obtained parameters with the edge node parameters.
[0089] Example 3:
[0090] The other detection procedures in this embodiment are the same as in Embodiment 1, and the improved YOLO v3 model network is as follows: Figure 3 As shown; Figure 4 For the existing YOLO v3 model network structure, Figure 3 and Figure 4Comparative analysis reveals that the backbone network of this invention, based on the existing Darknet53, adds a Hybrid Domain Attention (CBAM) module, including spatial attention and channel attention mechanisms. CBAM modules are directly inserted between residual blocks in the backbone network to enhance feature extraction performance. The input CBAM feature layers are 208×208×64, 104×104×128, 52×52×256, and 26×26×512. The backbone adds an upsampling step, transforming the 52×52×128 feature layer into a 104×104×64 feature layer. The 104×104×128 feature layer output from the second residual block of the backbone network is then fused with the 104×104×64 feature layer using a similar FBN approach, followed by convolution to output four detector heads. After feature fusion output in the backbone, three CBAM modules are added to improve feature extraction accuracy.
Claims
1. An intelligent networked vehicle pedestrian detection method based on improved YOLO v3, characterized in that, Includes the following steps: Step S100: Send road video data to the edge computing node; Step S200: The edge computing node preprocesses the received road video data to obtain the corresponding RGB image dataset, and uploads the RGB image dataset to the cloud server; Step S300: Edge computing nodes use the K-means++ clustering algorithm to generate prior boxes; Step S400: Generate a pedestrian detection model. The specific process is as follows: Step S401: Improve the YOLO v3 model network. The improved YOLO v3 model network includes 27 CBL modules, 3 upsampling modules, 3 feature fusion modules, 7 CBAM modules, and 4 detection heads. Each CBL module includes a two-dimensional convolutional Darknet Conv2D structure, batch normalization, and Leaky ReLU convolution with non-linear activation function; Each upsampling module uses 3×3 and 1×1 convolutions to change the size of the feature layer; The three feature fusion modules stack the output feature layer of the second-to-last residual block of the backbone network with the output feature layer of the first upsampling module, stack the output feature layer of the third-to-last residual block with the output feature layer of the second upsampling module, and stack the output feature layer of the fourth-to-last residual block with the output feature layer of the third upsampling module. The seven CBAM modules employ a hybrid domain attention mechanism to score both channel attention and spatial attention simultaneously. The four detection heads are output feature layers of 13×13×18, 26×26×18, 52×52×18 and 104×104×18 respectively; Step S402: Segment the improved YOLO v3 model network and deploy the segmented improved YOLO v3 model network on edge computing nodes and cloud servers respectively; wherein, on the edge computing nodes, an input layer for feature extraction, one CBL module and one residual block are constructed; on the cloud server, four residual blocks and 27 CBL modules for feature extraction and feature fusion are constructed. Step S403: Use edge-cloud collaboration to train the model based on transfer learning: Train the model multiple times on the cloud server based on the road pedestrian dataset to obtain the optimal weights, use them as the base model, and deploy them to the edge computing nodes. At the same time, use the base model as the initial parameters to train on the edge computing nodes. Aggregate the optimal weights obtained from training on the cloud server on the edge computing nodes and generate a pedestrian detection model. Step S500: Evaluate the pedestrian detection model and detection performance obtained in step S403 on the edge computing nodes: Step S501: Use the pedestrian detection model on the edge computing node to perform pedestrian detection on the test set, and calculate the detection accuracy of the model according to the PASCAL VOC standard. Step S502: Use the weight with the highest detection accuracy for pedestrian detection; In step S600, the edge computing node performs pedestrian detection and uses the roadside communication unit (RSU) to broadcast the pedestrian detection results to the intelligent connected vehicle.
2. The improved YOLO v3-based intelligent networked vehicle pedestrian detection method according to claim 1, characterized in that, The detailed process of step S100 is as follows: Step S101: Use the onboard camera of the intelligent connected vehicle to capture road video data while driving on the road; Step S102: The intelligent connected vehicle compresses the collected road video data; Step S103: The intelligent connected vehicle sends the compressed road video data to the edge computing node using vehicle-to-the-net (V2N) technology. 3.The improved YOLO v3-based intelligent networked vehicle pedestrian detection method of claim 1, wherein, The detailed process of step S200 includes: Step S201: The edge computing node uses the RTSP protocol to receive road video data; Step S202: The edge computing node converts the road video data format into PASCAL VOC format image data; Step S203: The edge computing node uses median filtering and gradient method to perform image preprocessing, denoising, brightness adjustment and converting the image data obtained in step S202 into an RGB image; Step S204: The edge computing node uploads the processed RGB image dataset to the cloud server for data storage.
4. The intelligent connected vehicle pedestrian detection method based on improved YOLO v3 according to claim 1, characterized in that, In step S300, 12 prior boxes are obtained by clustering based on the K-mens++ clustering algorithm. Then, the intersection-union function (IOU) between the ground truth boxes and the bounding boxes is used as the distance metric. The distance metric result with an accuracy of 83.20% is used as the final value of the prior box, namely [6, 14], [8, 17], [8, 23], [10, 20], [11, 25], [11, 34], [14, 29], [15, 38], [18, 47], [23, 58], [33, 87] and [62, 158]; The distance metric is calculated as follows: (1) In the above formula, "box" represents bounding boxes; This represents the anchor boxes, i.e., the cluster centers of the bounding boxes; IOU represents the intersection-union ratio of bounding boxes and anchor boxes; The IOU is the intersection-union ratio of the actual bounding box and the cluster center, and it is calculated as follows: (2) In the above formula, Gt represents the ground truth bounding box of the target, and Dr represents the predicted bounding box of the target.
5. The intelligent connected vehicle pedestrian detection method based on improved YOLO v3 according to claim 1, characterized in that, The detailed process of edge computing nodes and cloud servers performing edge-cloud collaborative training in step S403 is as follows: First, the data is augmented, and then convolution operations are performed on edge computing nodes to obtain a 208×208×64 residual block; Then, the convolution result is uploaded to the cloud server to complete the feature extraction and three-stage feature fusion of Darknet53; The training process involves inputting the training set into the improved YOLO v3 model in batches, performing forward propagation, and calculating the loss values of the parameters. Finally, four detector heads are output, and the training parameters are adjusted through backpropagation. The loss values for the calculated parameters include: predicted box loss, confidence loss, and class loss; the predicted box loss uses the sum of squared errors loss, and the confidence loss and class loss use binary cross-entropy, as shown in the following formulas: (3) In the above formula, It is the penalty coefficient for coordinate prediction. It is the penalty coefficient for confidence level when there is no target. M represents the number of grid cells, where M is the number of bounding boxes predicted for each grid cell. , , and The x-coordinate, y-coordinate, width, and height represent the center of the predicted target. , , and The x-coordinate, y-coordinate, width, and height represent the true center of the target. The i-th grid cell represents the location of the bounding box of the j-th candidate box, and is used to inspect the target item. The i-th grid cell containing the target bounding box of the j-th candidate box does not check for the target item. This represents the prediction confidence level for a given category. This represents the true confidence level belonging to a certain category. The table represents the predicted probability that the target in the i-th grid belongs to a certain category. Let c represent the true probability value of a target in the i-th grid belonging to a certain category, where c represents the category and classes represents the total number of categories. The forward propagation described above uses Leaky ReLU as the activation function, as shown in the following formula: (4) In the above formula, It is a fixed parameter within (0,1), and x is the output value; The target detection results are filtered using nonmaximum suppression (NMS) based on the confidence parameter.
6. The pedestrian detection method for intelligent connected vehicles based on improved YOLO v3 according to claim 1, characterized in that, The detailed method of step S502 includes: The model with the highest mAP result was selected as the pedestrian detection result. The steps of pedestrian detection are as follows: (1) reading model parameters; (2) inputting detection data; (3) using the prior boxes of K-means++ clustering as the initial boxes for pedestrian detection, performing target box decoding to obtain the coordinates of the pedestrian center point and width and height parameters; wherein the target box decoding method includes: performing lossless resizing on the image; normalizing the image; generating prior box parameters (x, y, w, h); generating prior box adjustment parameters; and returning the output result as the prediction box.