A traffic cone bucket 3D real-time detection method, system, device and medium based on a monocular camera
By combining a monocular camera with YOLOv5 and PatchNet algorithms, the three-dimensional spatial information of traffic cones can be detected in real time, solving the problems of high cost and insufficient accuracy in existing technologies. This achieves low-cost, high-precision traffic cone detection, supporting traffic management and autonomous driving.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- XIDIAN UNIV
- Filing Date
- 2024-01-25
- Publication Date
- 2026-06-23
AI Technical Summary
In existing technologies, the detection of three-dimensional spatial information of traffic cones relies on lidar or depth cameras, which results in high costs and inconvenient maintenance. Existing algorithms are insufficient in terms of detection accuracy and speed, and cannot meet the needs of real-time traffic management.
A monocular camera combined with the YOLOv5 algorithm is used to detect the bounding box of the cone. PatchNet, a deep residual network, is used to regress the key points of the cone edge, and the 3D spatial information is obtained through the inverse projection algorithm, which reduces the detection cost and improves real-time performance and accuracy.
It achieves low-cost, high-precision 3D spatial information detection of traffic cones, and can provide accurate road conditions in real time in complex traffic environments, providing support for traffic management and autonomous driving.
Smart Images

Figure CN117726880B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of computer vision and image processing technology, and specifically relates to a method, system, device and medium for real-time 3D detection of traffic cones based on a monocular camera. Background Technology
[0002] Traffic cones are commonly used as markers in road construction and traffic management on complex road sections, and they have significant practical value. Accurately detecting the information of traffic cones in three-dimensional space can improve road safety, traffic efficiency, and the effectiveness of construction management. In autonomous driving scenarios, vehicle-mounted cameras also need to detect traffic cones in real time to comply with traffic commands and avoid obstacles. Single-stage object detection algorithms can detect the bounding boxes of cones in images in real time, but obtaining the depth information of cones often requires the use of LiDAR or depth cameras, the high cost of which limits their application in various industrial scenarios. Advances in computer vision have shown that even monocular cameras can be used to reveal the physical location of cones in the 3D world, and using monocular cameras is more cost-effective and maintainable.
[0003] Significant progress has been made in the detection of traffic signs and traffic lights in recent years. However, real-time detection of traffic cones has not received enough attention. Existing target detection algorithms for road traffic cones can detect target bounding boxes in real time. However, to obtain information about the cones in three-dimensional space, it is necessary to use LiDAR or depth cameras. The obvious disadvantage of this approach is that it is expensive and difficult to maintain.
[0004] Patent application CN114724105A discloses a cone recognition method in complex backgrounds based on a cloud-edge-device architecture. The cloud-edge-device architecture consists of a cloud server, an edge controller, and a terminal. The terminal is responsible for collecting image data and uploading the collected images of the road conditions ahead to the edge controller with which it establishes communication. The edge controller identifies the location information of the cones in the image and sends it back to the terminal device. However, since this method requires the deployment of edge controllers on both sides of the road in advance to establish communication with the terminal, the terminal device cannot directly process the image data, resulting in problems such as complex structure, high cost, and large detection latency.
[0005] Patent application CN112183485A discloses a traffic cone detection and localization method, system, and storage medium based on deep learning. By detecting traffic cones in color images and depth images respectively and matching the results of the two, the final category and three-dimensional spatial location of the traffic cones are obtained. However, since this method must obtain the depth information of the cones by using a binocular camera, and the YOLOv4 algorithm used in the color image has insufficient detection accuracy and speed, it results in problems such as high cost and poor detection performance. Summary of the Invention
[0006] To overcome the shortcomings of the existing technology, the present invention aims to provide a method, system, device, and medium for real-time 3D detection of traffic cones based on a monocular camera. The method utilizes the YOLOv5 algorithm to detect traffic cones and extracts individual image blocks of each traffic cone from the original image. Then, it uses PatchNet based on a deep residual network (ResNet) to regress the key points of the traffic cone edges and completes the key point mapping through an inverse projection algorithm, thereby obtaining the information of the cones in three-dimensional space. The present invention is simple to implement, low in cost, and can be deployed in surveillance cameras to provide timely road condition information, creating conditions for traffic management departments to optimize traffic planning and decision-making processes.
[0007] To achieve the above objectives, the technical solution adopted by the present invention is as follows:
[0008] A method for real-time 3D detection of traffic cones based on a monocular camera includes the following steps:
[0009] S1. Collect image frames containing traffic cones captured from a monocular camera in different scenarios, construct a traffic cone dataset and manually label the bounding boxes of traffic cones. Use data augmentation to expand the traffic cone dataset, use the expanded traffic cone dataset to train a YOLOv5 neural network model, and use the trained parameter weights to perform 2D object detection on the traffic cone images to obtain the category and bounding box of the traffic cones.
[0010] S2, extract each traffic cone image patch using the category and bounding box obtained in step S1, and perform keypoint regression on the traffic cone image patch using PatchNet based on deep residual network (ResNet);
[0011] S3. Use the inverse projection algorithm to map the key points of the traffic cones after regression in step S2 to three-dimensional space to obtain the three-dimensional spatial information of the traffic cones.
[0012] The specific process of step S1 is as follows:
[0013] S1-1: Collect image frames containing traffic cones captured from a monocular camera in real road scenes, construct a traffic cone image dataset. The traffic cone image dataset includes samples from different times of day, different seasons, different light intensities, and different angles. Manually label the traffic cone image dataset to obtain the category of traffic cones and the normalized pixel coordinates of the bounding boxes.
[0014] S1-2, the traffic cone image dataset obtained in step S1-1 is augmented with data augmentation, including random scaling and cropping, random horizontal flipping and rotation, mixing (MixUp), Gaussian noise addition, HSV transformation, and copy and paste.
[0015] S1-3, cluster the traffic cone bounding boxes manually labeled in step S1-1, and pre-generate three prior boxes of different sizes;
[0016] S1-4: Set the hyperparameters of the YOLOv5 neural network model. Randomly shuffle the traffic cone image dataset processed in step S1-2 and divide it into training, validation, and test sets. Iterate the training on the training and validation sets to update the parameters of the YOLOv5 neural network model. Retain the training weights of the rounds with the highest average precision (MAP). Evaluate the detection performance on the test set. If the detection precision reaches 88% or higher and the recall reaches 85% or higher, the training is considered complete. Use the trained parameters to perform 2D object detection on the traffic cone images. The YOLOv5 neural network model predicts the normalized offset of the actual position of the cone relative to the prior box generated in S1-3. Finally, output the bounding box information of each traffic cone.
[0017] The specific process of step S2 is as follows:
[0018] S2-1, Based on the bounding box information of each traffic cone obtained in step S1-4, use OpenCV to extract each traffic cone sub-image block from the detected image;
[0019] S2-2, Convert the image blocks of each traffic cone extracted in step S2-1 from RGB to HSV, and filter out pixels with V or S values lower than 0.3;
[0020] S2-3, after processing in step S2-2, the image blocks of each traffic cone are uniformly downsampled and input into PatchNet based on deep residual network (ResNet), and the coordinates of seven key points on the edge of the traffic cone are output. The seven key points include the vertex of the cone, the four points where the center stripe, the background and the upper / lower stripe intersect, and the two points on the left and right sides of the bottom of the cone. They are represented by p0, p1, p2, p3, p4, p5 and p6 from top to bottom and from left to right, respectively.
[0021] Using prior pose information to constrain keypoint regression tasks, the loss function of PatchNet, based on a deep residual network (ResNet), consists of the mean square error of the absolute coordinates of each predicted keypoint and the vector parallelism error between keypoints. Its mathematical equation is:
[0022] Loss total =L mse +γhorz (2-V 12 ·V 34 -V 34 ·V 56 )+γ vert (4-V 01 ·V 13 -V 13 ·V 35 -V 02 ·V 24 -V 24 V 46 ).
[0023] The specific process of step S3 is as follows:
[0024] S3-1, Scale the key point pixel coordinates predicted from the traffic cone image in step S2-3 to the original image size;
[0025] S3-2, calibrate the monocular camera using a calibration board to obtain camera intrinsic parameters, focal length f, and optical center coordinates O. C ;
[0026] S3-3, Calculate the transformation matrix between the pixel coordinate system and the image coordinate system, and between the image coordinate system and the camera coordinate system, based on the monocular camera calibration parameters measured in step S3-2;
[0027] The origin of the pixel coordinate system is the top-left corner of the image, and the unit is pixels. In the image coordinate system, the optical center is the midpoint of the image, and the unit is millimeters (mm). The origin of the camera coordinate system is the optical center, and the unit is meters (m). Let P be a point in three-dimensional space, with coordinates (x, y) in the pixel coordinate system, (u, v) in the image coordinate system, and (x, y) in the camera coordinate system. C ,Y C Z C ),in:
[0028] The coordinate transformation relationship between pixel coordinate system and image coordinate system is:
[0029]
[0030] The transformation relationship between the image coordinate system and the camera coordinate system is:
[0031]
[0032] S3-4. Using the transformation matrix calculated in step S3-3, the pixel coordinates of the key points of the traffic cone are converted into three-dimensional coordinates in the camera coordinate system through the inverse projection algorithm, thereby obtaining the three-dimensional spatial information of the traffic cone relative to the monocular camera.
[0033] In steps S2-3, the PatchNet convolutional neural network based on a deep residual network (ResNet) takes an 80*80*3 sub-image patch as input and maps it to R. 14 The spatial dimension is chosen to be 80*80, which is the average size of the traffic cone bounding box. The input image first passes through a 64*7*7 convolutional layer, a BatchNorm layer, and a ReLU activation function. Then, it is sequentially input into a ResNet Basic Block with 64, 128, 256, and 512 channels respectively. The front part of the ResNet Basic Block with 6 channels consists of a C*3*3 convolutional layer, a BatchNorm2D layer, a ReLU layer, a C*3*3 convolutional layer, and a BatchNorm3D layer. The output of BatchNorm3D and the input of the ResNet Basic Block are connected by a skip residual connection. Finally, the ReLU activation function is passed to obtain the final output of the module. The output of the last ResNet Basic Block is input into a 1*14 fully connected layer to obtain the coordinates of seven key points on the edge of the traffic cone.
[0034] This invention also provides a 3D real-time detection system for traffic cones based on a monocular camera, comprising:
[0035] The traffic cone category and bounding box acquisition module is used to collect image frames containing traffic cones captured from monocular cameras in different scenarios, construct a traffic cone dataset, manually label the traffic cone bounding boxes, expand the traffic cone dataset with data augmentation, use the expanded traffic cone dataset to train a YOLOv5 neural network model, and use the trained parameter weights to perform 2D object detection on the traffic cone images to obtain the traffic cone category and bounding box.
[0036] Traffic cone image patch keypoint regression module: used to extract each traffic cone image patch by category and bounding box, and use PatchNet based on deep residual network (ResNet) to perform keypoint regression on the traffic cone image patch;
[0037] The 3D spatial information acquisition module for traffic cones is used to map the regressed key points of traffic cones into 3D space using an inverse projection algorithm, thereby obtaining the 3D spatial information of the traffic cones.
[0038] The present invention also provides a 3D real-time detection device for traffic cones based on a monocular camera, comprising:
[0039] Memory: A computer program that stores the above-mentioned method for real-time 3D detection of traffic cones based on a monocular camera, and is a computer-readable device;
[0040] Processor: Used to implement the aforementioned 3D real-time detection method for traffic cones based on a monocular camera when executing the computer program.
[0041] The present invention also provides a computer-readable storage medium storing a computer program that, when executed by a processor, enables the implementation of the aforementioned method for real-time 3D detection of traffic cones based on a monocular camera.
[0042] Compared with the prior art, the beneficial effects of the present invention are as follows:
[0043] 1. This invention uses only a monocular camera to estimate the information of traffic cones in three-dimensional space. The monocular camera on the terminal device captures image frames of the traffic cones. In steps S1-4, a trained YOLOv5 neural network model is used to detect the category and bounding box of the traffic cone from the image frames. In step S2-1, sub-image patches of each traffic cone are extracted from the categories and bounding boxes obtained in steps S1-4. In step S2-3, seven key points on the edge of the traffic cone are predicted. In steps S3-4, an inverse projection algorithm is used to convert the pixel coordinates of the key points of the cone into three-dimensional coordinates in the camera coordinate system, obtaining the three-dimensional spatial information of the traffic cone relative to the monocular camera. This method is lower in cost and more maintainable than previous solutions based on LiDAR and depth cameras.
[0044] 2. In this invention, the YOLOv5 single-stage target detection algorithm used in the traffic cone 2D detection submodule has the advantages of high detection accuracy and fast recognition speed. The self-designed PatchNet for regressing key points on the edge of the cone has a simplified network structure and makes full use of the prior pose of the cone to design the loss function, so it can predict the coordinates of the key points quickly and accurately. These features ensure that the algorithm can simultaneously take into account detection speed and accuracy. The high real-time processing capability enables the method to cope with complex traffic environments.
[0045] In summary, this invention fully utilizes the prior pose information of traffic cones, first detecting the bounding box of the traffic cone from the original image, then extracting the traffic cone sub-image blocks, regressing the seven key points of the cone body, and then using an inverse projection algorithm to complete the conversion from pixel coordinates to three-dimensional spatial coordinates. Finally, it obtains the three-dimensional spatial information of the cone, resulting in a simplified detection model structure and a combination of detection speed and accuracy. It has the advantages of good real-time detection performance and low application cost. Attached Figure Description
[0046] Figure 1This is a schematic diagram of the key points and parallel vectors of the traffic cones in this invention.
[0047] Figure 2 This is a schematic diagram of the PatchNet structure based on the deep residual network (ResNet) of this invention.
[0048] Figure 3 This is a diagram showing the coordinate transformation relationship between the pixel coordinate system and the image coordinate system of this invention. The unit of the pixel coordinate system is pixel, and the unit of the image coordinate system is mm. The transformation relationship for each row is 1 pixel = dx mm, and the transformation relationship for each column is 1 pixel = dy mm.
[0049] Figure 4 This is a diagram showing the coordinate transformation relationship between the camera coordinate system and the image coordinate system of this invention. Detailed Implementation
[0050] The technical solution of the present invention will be further described below with reference to the accompanying drawings.
[0051] This invention mainly consists of three sub-modules: First, based on real-time video images captured by a camera, the YOLOv5 deep learning-based 2D object detection algorithm is used to detect traffic cones in the images in real time, identifying the bounding boxes of all traffic cones. Second, leveraging the prior pose of the traffic cones, PatchNet, based on a deep residual network (ResNet), is used to perform keypoint regression on each previously detected cone sub-image patch. A loss function is designed based on the relative pose characteristics of several key points, thus ensuring accurate keypoint regression. Third, an inverse projection algorithm is used to achieve mutual transformation between different coordinate systems and estimate the depth information of the traffic cones, thereby obtaining the information of the traffic cones in three-dimensional space.
[0052] This invention fully considers the prior pose information of the cone. First, the bounding box of the cone is detected from the original image. Then, sub-image blocks of the cone are extracted, and seven key points of the cone's body are regressed. Next, an inverse projection algorithm is used to convert pixel coordinates to 3D spatial coordinates, and finally, the 3D reconstruction of the cone is completed. The advantages of this invention are its simple and clear solution, balancing detection speed and accuracy while minimizing application costs.
[0053] Specifically, a real-time 3D detection method for traffic cones based on a monocular camera includes the following steps:
[0054] S1 mainly includes the collection, enhancement, and segmentation of the traffic cone dataset, as well as 2D target detection of the cones; collecting image frames containing traffic cones captured from monocular cameras in different scenarios, constructing a traffic cone dataset and manually annotating the traffic cone bounding boxes, using data augmentation to expand the traffic cone dataset, using the expanded traffic cone dataset to train a YOLOv5 neural network model, and using the trained parameter weights to perform 2D target detection on the traffic cone images to obtain the category and bounding box of the traffic cones;
[0055] The specific process of step S1 is as follows:
[0056] S1-1: Collect image frames containing traffic cones captured from a monocular camera in real road scenes, construct a traffic cone image dataset. The traffic cone image dataset includes samples from different times of day, different seasons, different light intensities, and different angles. Manually label the traffic cone image dataset to obtain the category of traffic cones and the normalized pixel coordinates of the bounding boxes.
[0057] S1-2, the traffic cone image dataset obtained in step S1-1 is augmented with data augmentation, including random scaling and cropping, random horizontal flipping and rotation, mixing (MixUp), Gaussian noise addition, HSV transformation, and copy and paste.
[0058] S1-3, cluster the traffic cone bounding boxes manually labeled in step S1-1, and pre-generate three prior boxes of different sizes;
[0059] S1-4: Set the hyperparameters of the YOLOv5 neural network model. Randomly shuffle the traffic cone image dataset processed in step S1-2 and divide it into training, validation, and test sets. Iterate the training on the training and validation sets to update the parameters of the YOLOv5 neural network model. Retain the training weights of the rounds with the highest average precision (MAP). Evaluate the detection performance on the test set. If the detection precision reaches 88% or higher and the recall reaches 85% or higher, the training is considered complete. Use the trained parameters to perform 2D object detection on the traffic cone images. The YOLOv5 neural network model predicts the normalized offset of the actual position of the cone relative to the prior box generated in S1-3. Finally, output the bounding box information of each traffic cone.
[0060] S2, extract each traffic cone image patch using the category and bounding box obtained in step S1, and perform keypoint regression on the traffic cone image patch using PatchNet based on deep residual network (ResNet);
[0061] The specific process of step S2 is as follows:
[0062] S2-1, Based on the bounding box information of each traffic cone obtained in step S1-4, use OpenCV to extract each traffic cone sub-image block from the detected image;
[0063] S2-2, Convert the image blocks of each traffic cone extracted in step S2-1 from RGB to HSV, and filter out pixels with V or S values below 0.3. This helps to segment by removing the asphalt background and also helps to remove the stripes on each cone because the area lacks texture.
[0064] S2-3, after processing in step S2-2, the image blocks of each traffic cone are uniformly downsampled and input into PatchNet based on a deep residual network (ResNet), such as... Figure 2 As shown, PatchNet, based on a deep residual network (ResNet), takes 80*80*3 sub-image patches as input and maps them to R... 14 The spatial dimension is chosen to be 80*80, which is the average size of the traffic cone bounding box. The input image first passes through a 64*7*7 convolutional layer, a BatchNorm layer, and a ReLU activation function. Then, it is sequentially input into a ResNet Basic Block with 64, 128, 256, and 512 channels respectively. Increasing the number of channels is to extract local features and deep semantic information of the image. The front part of the ResNet Basic Block with C channels consists of a C*3*3 convolutional layer, a BatchNorm2D layer, a ReLU layer, a C*3*3 convolutional layer, and a BatchNorm3D layer. The output of BatchNorm3D and the input of the ResNet Basic Block are connected via a skip residual connection, and finally, the ReLU activation function is passed to obtain the final output of the module. The output of the last ResNet Basic Block is then input into a 1*14 fully connected layer to obtain the coordinates of seven key points on the edge of the traffic cone. Figure 1 As shown, the seven key points include the vertex of the cone, the four points where the central stripe, the background, and the upper / lower stripe intersect, and the two points on the left and right sides of the bottom of the cone, which are represented by p0, p1, p2, p3, p4, p5, and p6 from top to bottom and from left to right, respectively.
[0065] To accurately predict the coordinates of key points on traffic cones, prior attitude information is used to constrain the key point regression task. The loss function of PatchNet, based on a deep residual network (ResNet), consists of the mean square error of the absolute coordinates of each predicted key point and the vector parallelism error between key points. Its mathematical equation is as follows:
[0066] Loss total =L mse +γ horz (2-V 12 ·V 34 -V 34 ·V 56 )+γ vert (4-V 01 ·V 13 -V 13 ·V 35 -V 02 ·V 24 -V 24 V 46 ).
[0067] S3. Use the inverse projection algorithm to map the seven key points of the traffic cones after regression in step S2 to three-dimensional space to obtain the three-dimensional spatial information of the traffic cones.
[0068] The specific process of step S3 is as follows:
[0069] S3-1, Scale the key point pixel coordinates predicted from the traffic cone image in step S2-3 to the original image size;
[0070] S3-2, calibrate the monocular camera using a calibration board to obtain camera intrinsic parameters, focal length f, and optical center coordinates O. C ;
[0071] S3-3, Calculate the transformation matrix between the pixel coordinate system and the image coordinate system, and between the image coordinate system and the camera coordinate system, based on the monocular camera calibration parameters measured in step S3-2;
[0072] The origin of the pixel coordinate system is the top-left corner of the image, and the unit is pixels. In the image coordinate system, the optical center is the midpoint of the image, and the unit is millimeters (mm). The origin of the camera coordinate system is the optical center, and the unit is meters (m). Let P be a point in three-dimensional space, with coordinates (x, y) in the pixel coordinate system, (u, v) in the image coordinate system, and (x, y) in the camera coordinate system. C ,Y C Z C ),in,
[0073] like Figure 3 As shown, the coordinate transformation relationship between the pixel coordinate system and the image coordinate system is:
[0074]
[0075] like Figure 4 As shown, the transformation relationship between the image coordinate system and the camera coordinate system is:
[0076]
[0077] S3-4. Using the transformation matrix calculated in step S3-3, the pixel coordinates of the key points of the traffic cone are converted into three-dimensional coordinates in the camera coordinate system through the inverse projection algorithm, thereby obtaining the three-dimensional spatial information of the traffic cone relative to the monocular camera.
[0078] Simulation Experiment Analysis
[0079] I. Data Collection and Labeling
[0080] 1. The traffic cone image dataset collected from different traffic scenarios contains approximately 3,000 images, of which 80% are used as the training set, 10% as the validation set, and 10% as the test set.
[0081] 2. Approximately 5,000 traffic cone sub-image patches were extracted from the original complete cone image dataset and the coordinates of the key points of the cone were manually labeled. Similarly, 80% of them were used as the training set, 10% as the validation set, and 10% as the test set. The ground truth values of the key point coordinates of the cone were obtained by manual measurement.
[0082] Task training set Validation set test set 2D inspection of cones 2400 300 300 Keypoint Regression 4000 500 500
[0083] II. Performance of the YOLOv5 neural network model in the 2D barrel detection task
[0084]
[0085] The table above summarizes the performance of the traffic cone 2D detection submodule. It can be seen that this submodule has a high accuracy and recall rate, and can detect the vast majority of traffic cone bounding boxes and avoid false detections.
[0086] III. Pixel Coordinate Prediction Error of PatchNet Based on Deep Residual Network (ResNet) in Keypoint Regression Task
[0087] Validation set test set MSE 3.252 3.445
[0088] The pixel coordinate prediction error values of PatchNet, based on the deep residual network (ResNet), represent the accuracy and robustness of the model in various scenarios. This is crucial for obtaining the 3D spatial coordinates of traffic cones relative to the monocular camera through the inverse projection algorithm.
[0089] IV. 3D Spatial Coordinate Prediction Performance of Traffic Cones
[0090] The 3D spatial coordinate prediction results of 150 traffic cones were randomly selected and compared with the ground truth data measured manually. When the distance to the monocular camera was about 10 meters, the average Euclidean error of the traffic cone prediction was about 0.44 meters. When the distance to the monocular camera was 20 meters, the average Euclidean error of the prediction was about 0.82 meters. This prediction error is very small and is sufficient to allow autonomous vehicles to travel at speeds of over 65 km / h on roads with traffic cones ahead.
[0091] In summary, compared with existing technologies, the YOLOv5 target detection algorithm used in the 2D traffic cone detection submodule of this invention has higher detection accuracy and faster recognition speed. The self-designed PatchNet network structure for regressing key points on the cone edge is simplified and makes full use of the prior pose of the cone to design the loss function, ensuring fast and accurate prediction of the key point coordinates on the cone edge and completing the stereo reconstruction of the cone in the 3D world. The solution is simple and clear, and the algorithm balances detection speed and accuracy. Since the entire process can be completed using only a monocular camera, the cost is also lower than that of solutions using LiDAR and depth cameras.
[0092] The 3D real-time detection method for traffic cones based on a monocular camera can acquire key information such as the position, size, and orientation of traffic cones, providing important support for fields such as traffic monitoring, autonomous driving, and road construction.
[0093] 1) This invention extracts the cone-shaped image blocks from the original image and makes full use of the prior pose information of the cone. It uses PatchNet based on deep residual network (ResNet) to regress the seven key points of the cone. Then, it uses the inverse projection algorithm to convert the pixel coordinates into three-dimensional coordinates in the camera coordinate system, thereby completing the stereo reconstruction of the cone in the three-dimensional world.
[0094] 2) This invention can also estimate the information of the cone in three-dimensional space in real time using a monocular camera. Compared with the previous solution based on LiDAR and depth camera, this method is lower in cost and more maintainable.
[0095] 3) This invention has high real-time processing capabilities. This method can cope with complex traffic environments and provide timely road condition information. By monitoring and analyzing traffic cones, it can provide important data support for traffic management departments and optimize traffic planning and decision-making processes.
[0096] The key points and protection points of this invention are:
[0097] 1. Extract cone-shaped image patches from the resulting image after 2D object detection;
[0098] 2. PatchNet, a cone keypoint prediction model designed based on deep residual network (ResNet), and a loss function designed based on the prior pose information of the cone;
[0099] 3. Based on steps 1 and 2, scale each cone-shaped image block to 80*80 and input it into the keypoint regression network to predict the coordinates of seven keypoints of the cone.
[0100] 4. Based on step 3, according to the camera calibration parameters and the coordinate transformation matrix between coordinate systems, the key points of the cone are transformed from the pixel coordinate system into a set of three-dimensional geometric points in the camera coordinate system, thereby completing the 3D reconstruction of the cone.
[0101] By embedding the proposed 3D real-time traffic cone detection method based on a monocular camera into a roadside visual sensor, the three-dimensional spatial information of all traffic cones within the road section can be obtained. This allows for the accurate delineation of the construction area and the road control zone, which are then visualized, providing real-time road information to traffic management departments.
[0102] Deploying the proposed monocular camera-based 3D real-time traffic cone detection method in the vehicle control system can detect the three-dimensional information of each traffic cone in front of the vehicle in real time, capture the current road traffic conditions, and provide sufficient information for the vehicle to automatically avoid obstacles and follow traffic instructions.
[0103] This invention also provides a 3D real-time detection system for traffic cones based on a monocular camera, comprising:
[0104] The traffic cone category and bounding box acquisition module is used to collect image frames containing traffic cones captured from a monocular camera in different scenarios in step 1, construct a traffic cone dataset and manually label the traffic cone bounding boxes, expand the traffic cone dataset with data augmentation, use the expanded traffic cone dataset to train a YOLOv5 neural network model, and use the trained parameter weights to perform 2D object detection on the traffic cone images to obtain the traffic cone category and bounding box.
[0105] Traffic cone image patch key point regression module: used to extract each traffic cone image patch by category and bounding box in step 2, and use PatchNet based on deep residual network (ResNet) to perform key point regression on the traffic cone image patch;
[0106] The 3D spatial information acquisition module for traffic cones is used to map the regressed key points of traffic cones into 3D space using the inverse projection algorithm in step 3, thereby obtaining the 3D spatial information of the traffic cones.
[0107] The present invention also provides a 3D real-time detection device for traffic cones based on a monocular camera, comprising:
[0108] Memory: A computer program that stores the above-mentioned method for real-time 3D detection of traffic cones based on a monocular camera, and is a computer-readable device;
[0109] Processor: Used to implement the aforementioned 3D real-time detection method for traffic cones based on a monocular camera when executing the computer program.
[0110] The present invention also provides a computer-readable storage medium storing a computer program that, when executed by a processor, enables the implementation of the aforementioned method for real-time 3D detection of traffic cones based on a monocular camera.
Claims
1. A method for real-time 3D detection of traffic cones based on a monocular camera, characterized in that: Includes the following steps: S1. Collect image frames containing traffic cones captured from a monocular camera in different scenarios, construct a traffic cone dataset and manually label the bounding boxes of traffic cones. Use data augmentation to expand the traffic cone dataset, use the expanded traffic cone dataset to train a YOLOv5 neural network model, and use the trained parameter weights to perform 2D object detection on the traffic cone images to obtain the category and bounding box of the traffic cones. S2, extract each traffic cone image patch using the category and bounding box obtained in step S1, and perform keypoint regression on the traffic cone image patch using PatchNet based on deep residual network (ResNet); S3, use the inverse projection algorithm to map the key points of the traffic cones after regression in step S2 to three-dimensional space to obtain the three-dimensional spatial information of the traffic cones; The specific process of step S1 is as follows: S1-1: Collect image frames containing traffic cones captured from a monocular camera in real road scenes, construct a traffic cone image dataset. The traffic cone image dataset includes samples from different times of day, different seasons, different light intensities, and different angles. Manually label the traffic cone image dataset to obtain the category of traffic cones and the normalized pixel coordinates of the bounding boxes. S1-2, the traffic cone image dataset obtained in step S1-1 is augmented with data augmentation, including random scaling and cropping, random horizontal flipping and rotation, blending, Gaussian noise addition, HSV transformation, and copy and paste. S1-3, cluster the traffic cone bounding boxes manually labeled in step S1-1, and pre-generate three prior boxes of different sizes; S1-4: Set the hyperparameters of the YOLOv5 neural network model. Randomly shuffle the traffic cone image dataset processed in step S1-2 and divide it into training, validation, and test sets. Iterate the training on the training and validation sets to update the parameters of the YOLOv5 neural network model. Retain the training weights of the rounds with the highest average precision (MAP). Evaluate the detection performance on the test set. If the detection precision reaches 88% or higher and the recall reaches 85% or higher, the training is considered complete. Use the trained parameters to perform 2D object detection on the traffic cone images. The YOLOv5 neural network model predicts the normalized offset of the actual position of the cone relative to the prior box generated in S1-3. Finally, output the bounding box information of each traffic cone. In step S2, the PatchNet convolutional neural network based on a deep residual network (ResNet) takes an 80*80*3 sub-image patch as input and maps it to... The spatial dimension is chosen to be 80*80, which is the average size of the traffic cone bounding box. The input image first passes through a 64*7*7 convolutional layer, a BatchNorm layer, and a ReLU activation function. Then, it is sequentially input into a basic deep residual network module with C channels of 64, 128, 256, and 512. The front part of the basic deep residual network module with C channels consists of a C*3*3 convolutional layer, a BatchNorm2D layer, a ReLU layer, a C*3*3 convolutional layer, and a BatchNorm3D layer. The output of BatchNorm3D and the input of the basic deep residual network module are connected by a skip residual connection. Finally, the output of the module is obtained by passing through the ReLU activation function. The output of the last basic deep residual network module is input into a 1*14 fully connected layer to obtain the coordinates of seven key points on the edge of the traffic cone.
2. The method for real-time 3D detection of traffic cones based on a monocular camera according to claim 1, characterized in that: The specific process of step S2 is as follows: S2-1, Based on the bounding box information of each traffic cone obtained in step S1-4, use OpenCV to extract each traffic cone sub-image block from the detected image; S2-2, Convert the image blocks of each traffic cone extracted in step S2-1 from RGB to HSV, and filter out pixels with V or S values lower than 0.3; S2-3, after processing in step S2-2, the image blocks of each traffic cone are uniformly downsampled and input into PatchNet based on a deep residual network (ResNet). The coordinates of seven key points on the edge of the traffic cone are output. These seven key points include the vertex of the cone, the four points where the center stripe, background, and upper / lower stripes intersect, and two points on the left and right sides of the bottom of the cone. These key points are represented from top to bottom and from left to right using... express; Using prior pose information to constrain keypoint regression tasks, the loss function of PatchNet, based on a deep residual network (ResNet), consists of the mean square error of the absolute coordinates of each predicted keypoint and the vector parallelism error between keypoints. Its mathematical equation is: 。 3. The method for real-time 3D detection of traffic cones based on a monocular camera according to claim 1, characterized in that: The specific process of step S3 is as follows: S3-1, Scale the key point pixel coordinates predicted from the traffic cone image in step S2-3 to the original image size; S3-2, calibrate the monocular camera using a calibration board to obtain camera intrinsic parameters and focal length. Optical center coordinates ; S3-3, Calculate the transformation matrix between the pixel coordinate system and the image coordinate system, and between the image coordinate system and the camera coordinate system, based on the monocular camera calibration parameters measured in step S3-2; The origin of the pixel coordinate system is the top left corner of the image, and the unit is pixels; In the image coordinate system, the optical center is the midpoint of the image, and the unit is millimeters; the origin of the camera coordinate system is the optical center, and the unit is meters; let P be a point in three-dimensional space, and its coordinates in the pixel coordinate system be... The coordinates in the image coordinate system are The coordinates in the camera coordinate system are ,in: The coordinate transformation relationship between pixel coordinate system and image coordinate system is: , The transformation relationship between the image coordinate system and the camera coordinate system is: ; S3-4. Using the transformation matrix calculated in step S3-3, the pixel coordinates of the key points of the traffic cone are converted into three-dimensional coordinates in the camera coordinate system through the inverse projection algorithm, thereby obtaining the three-dimensional spatial information of the traffic cone relative to the monocular camera.
4. A 3D real-time traffic cone detection system based on a monocular camera for implementing the method of any one of claims 1-3, characterized in that: include: The traffic cone category and bounding box acquisition module is used to collect image frames containing traffic cones captured from monocular cameras in different scenarios, construct a traffic cone dataset, manually label the traffic cone bounding boxes, expand the traffic cone dataset with data augmentation, use the expanded traffic cone dataset to train a YOLOv5 neural network model, and use the trained parameter weights to perform 2D object detection on the traffic cone images to obtain the traffic cone category and bounding box. Traffic cone image patch keypoint regression module: used to extract each traffic cone image patch by category and bounding box, and use PatchNet based on deep residual network (ResNet) to perform keypoint regression on the traffic cone image patch; The 3D spatial information acquisition module for traffic cones is used to map the regressed key points of traffic cones into 3D space using an inverse projection algorithm, thereby obtaining the 3D spatial information of the traffic cones.
5. A 3D real-time detection device for traffic cones based on a monocular camera, characterized in that: include: Memory: A computer program for a real-time 3D detection method for traffic cones based on a monocular camera, as described in any one of claims 1-3, is a computer-readable device; Processor: Used to implement the 3D real-time detection method for traffic cones based on a monocular camera as described in any one of claims 1-3 when executing the computer program.
6. A computer-readable storage medium, characterized in that: The computer-readable storage medium stores a computer program that, when executed by a processor, enables the implementation of the 3D real-time detection method for traffic cones based on a monocular camera as described in any one of claims 1-3.