A driving environment perception method based on a vehicle-mounted monocular camera

By optimizing the upsampling module of a monocular camera through structural reparameterization and a multi-task deep learning model, multi-task perception based on a monocular camera is realized, solving the problems of high sensor cost and low perception efficiency, and improving the perception accuracy and speed of the autonomous driving system.

CN116311113BActive Publication Date: 2026-06-23SUZHOU INST FOR ADVANCED STUDY USTC

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SUZHOU INST FOR ADVANCED STUDY USTC
Filing Date
2023-02-10
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing monocular cameras cannot efficiently perform multi-task perception in autonomous driving, and the high cost of sensors makes them difficult to widely apply in ordinary civilian vehicles.

Method used

The system employs a structure-reparameterized upsampling module and a multi-task deep learning model to complete object detection, road drivable area segmentation, and lane line segmentation in a single inference. It utilizes an onboard monocular camera for environmental perception and combines an encoder-decoder network structure with an improved BiFPN network to optimize the upsampling module and improve inference speed and accuracy.

Benefits of technology

It improves the throughput of autonomous driving systems, reduces power consumption and memory usage, and enhances the accuracy of multi-task perception, especially outperforming traditional methods in semantic segmentation tasks.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116311113B_ABST
    Figure CN116311113B_ABST
Patent Text Reader

Abstract

The application discloses a driving environment perception method based on a vehicle-mounted monocular camera, which comprises structure reparameterization on an up-sampling module and automatic driving multi-task perception. Compared with common linear interpolation and transpose convolution, the application uses RepUpsample to improve the accuracy of the network model to a certain extent. In the semantic segmentation model task, compared with the accuracy performance of DeepLabv3, FPN and U-Net three models when using different up-sampling modules, RepUpsample as the up-sampling method can improve the performance of the semantic segmentation network in different network models, different up-sampling positions and different network scales. Compared with the bilinear interpolation algorithm, the average mIOU can be improved by 1.77%, and the average P.A. can be improved by 0.74%. Compared with the transpose convolution, the average mIOU can be improved by 1.16%, and the average P.A. can be improved by 0.35%.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of autonomous driving, and in particular to a driving environment perception method based on an in-vehicle monocular camera. Background Technology

[0002] In recent years, deep learning and computer vision technologies have rapidly emerged, and autonomous driving technology based on these technologies offers new solutions for improving traffic safety and efficiency. Autonomous driving systems do not tire and strictly adhere to traffic rules, possessing enormous potential in reducing accident rates. Autonomous driving combines multiple technologies, including artificial intelligence, communications, semiconductors, and automobiles. It involves a wide range of industries and can create enormous value, making it a fiercely contested arena for cross-industry competition and cooperation between the automotive and technology industries worldwide. Giants such as Google, Tesla, GM, and Baidu have invested heavily in developing autonomous driving technology. Driven by a combination of factors, including technological advancements, policy support, the entry of industry giants, capital investment, cost reductions, and clear application scenarios, autonomous driving technology, after more than a decade of exploration and development, is now at a critical juncture for commercialization.

[0003] An autonomous driving system is a comprehensive system integrating environmental perception, decision-making and control, and action execution. Autonomous driving systems are large and complex, and can generally be divided into three main modules based on function: perception, decision-making, and control. The perception module is defined as the collection and processing of environmental and in-vehicle information. It perceives information about the surrounding environment through various sensors, involving tasks such as road boundary detection, drivable area detection, traffic sign recognition, vehicle and pedestrian detection, and road surface information perception. The perception module is the cornerstone of the entire system; only when the perception system provides accurate information can the decision-making system make correct judgments. The robustness of the perception system directly affects the reliability of the entire autonomous driving system. The decision-making module can be understood as making decisions based on the perceived information, determining the appropriate working model, specifying the corresponding control strategy, and making driving decisions on behalf of the driver. The execution module, after the system makes a decision, implements the decision results to control the vehicle. All the vehicle's operating systems need to be able to connect to the decision-making system via a bus, and implement commands to precisely control driving actions such as acceleration, braking, steering, and lighting control to achieve autonomous vehicle control.

[0004] In the perception module, accurate perception requires the cooperation of multiple sensors. Current methods can be categorized into three types: perception methods based on monocular images, those based on laser point clouds, and those based on millimeter-wave radar. Both laser radar and millimeter-wave radar are active sensors. They emit detection signals in all directions and receive signals reflected from objects. By comparing the emitted and received signals, they calculate the object's location and distance. Cameras, on the other hand, are passive sensors that perceive objects through light reflection. Each of these three sensors has its own advantages and disadvantages. Laser radar can generate high-precision 3D point cloud information, but currently, it is expensive, has a short lifespan, and suffers from significant noise in rainy or snowy weather. Millimeter-wave radar is less affected by weather conditions and is sensitive to moving objects, but its disadvantages include lower resolution and poorer discrimination ability. Monocular cameras offer the advantages of low cost, high resolution, and rich visual information, but they cannot acquire depth information. For safety reasons, autonomous vehicles generally incorporate all three types of sensors, and Advanced Driving Assistance Systems (ADAS) utilize these sensors for environmental perception. However, installing these sensors in ordinary civilian vehicles is too expensive, while ordinary monocular cameras have a more readily available hardware foundation. Therefore, ADAS based on monocular cameras is more practical. Summary of the Invention

[0005] The purpose of this invention is to provide a driving environment perception method based on an in-vehicle monocular camera, aiming to solve the following problems:

[0006] Optimizing the upsampling module in neural networks: Deep neural networks commonly use transposed convolution or bilinear interpolation as upsampling modules. Bilinear interpolation has no learnable parameters, resulting in fast inference speed but weak expressive power, while transposed convolution has learnable parameters and stronger expressive power. Recognizing the similarities between the two, a multi-branch upsampling structure is used during the training phase, while losslessly merging the multiple branches into a single branch during the inference phase increases network capacity without sacrificing inference speed.

[0007] Multi-task perception for autonomous driving: The autonomous driving perception module needs to utilize various sensors on the vehicle to perceive the surrounding environment, including object detection, road drivable area segmentation, and lane detection. Monocular cameras are more practical due to their low cost and wider availability in hardware. This patent designs a multi-task algorithm for autonomous driving perception based on images captured by an onboard monocular camera, combined with a structural reparameterization upsampling module. This algorithm completes multiple tasks in a single inference, improving the overall system throughput and reducing power consumption and memory usage.

[0008] The technical solution of this invention is:

[0009] A driving environment perception method based on an in-vehicle monocular camera includes:

[0010] S1. Perform structural reparameterization on the upsampling module: This includes the training and inference phases, where...

[0011] During the training phase, a transposed convolutional layer of the upsampling module is expanded into a multi-branch layer, with one branch using a linear interpolation algorithm and the other branches using transposed convolutions with different kernel sizes.

[0012] During the inference phase, the multi-branch structure is reparameterized and transformed into a single-branch structure without loss.

[0013] S2, Multi-task perception for autonomous driving: Through a multi-task deep learning model, real-time perception reasoning based on a monocular camera is achieved, completing three tasks: target detection, road drivable area segmentation, and lane line segmentation.

[0014] Preferably, during the training phase in S1, linear interpolation is performed by adding 1*1 convolutional layers to change the number of channels in the output feature map; batch normalization is added after each upsampling branch to further improve model performance.

[0015] Preferably, in the inference phase of S1, for double upsampling, during training, the three branches use bilinear interpolation followed by 1*1 convolutions, 2*2 transposed convolutions, and 4*4 transposed convolutions, respectively, and are subsequently connected to batch normalization layers; input feature maps The feature map is obtained after upsampling.

[0016] The kernel of a 4×4 transposed convolution is bias Since it is a transposed convolution, padding is required before convolution. The convolution kernel in the convolution process is as follows: The correspondence is as shown in Formula 1, first transpose. The first two dimensions, then reversed The last two dimensions;

[0017]

[0018] After convolution, the parameters are normalized by a BN layer. During inference, the convolution merges the parameters of the BN layer, and the merged transposed convolution parameters are shown in Equations 2 and 3.

[0019]

[0020]

[0021] Where γ, β, σ, and μ correspond to the weights, biases, variances, and means of the BN layer, respectively, yielding the weight W. 4×4 Finally, the transposed convolution weights are obtained through the transformation in Formula 3.

[0022] The kernel of a 2×2 transposed convolution is The weights are obtained by padding the outer edge of a 2×2 convolutional kernel with zeros to a size of 4×4, and then fusing the parameters of the BN layer.

[0023] Bilinear interpolation followed by a 1x1 convolution, with its convolution kernel... Unbiased, but with a 1×1 convolution kernel that changes the number of channels. bias The resulting 4×4 transposed convolution is first fused with the parameters of a 1×1 convolution, and the new weights are shown in formulas 4 and 5:

[0024] W bilinear ←W 1×1 ×W bilinear (4)

[0025] b bilinear ←b 1×1 (5)

[0026] Finally, the parameters of the BN layer are combined to obtain the weights.

[0027] After each of the three branches is independently transformed into a 4×4 transposed convolution, the weights and biases of the three transposed convolutions are added together to obtain the final result, as shown in Equations 6 and 7:

[0028]

[0029]

[0030] By reparameterizing the structure, the complex multi-branch structure during training can be compressed without loss.

[0031] Preferably, in S2, the overall structure of the multi-task deep learning model adopts an encoder-decoder network structure. For the three tasks, different decoder head networks are used respectively, while the encoder network is shared. The encoder network is further divided into a backbone network and a neck network according to different functions and positions. The backbone network directly receives input from the camera and is used to mine shallow feature information of the network. The neck network receives feature information from the backbone network, performs further feature fusion and feature extraction, obtains deeper feature information, and passes it to different decoder networks.

[0032] Preferably, the Backbone network uses ResNet, and the image captured by the monocular camera is scaled to 640*340 and passed through a multi-layer residual structure to obtain feature maps of different sizes.

[0033] Preferably, the Neck network adopts an improved BiFPN structure, which contains multiple upsampling operations, replacing the upsampling in the BiFPN with reparameterizable RepUpsample;

[0034] Multiple BiFPN connections constitute the Neck structure of the entire perceptual network. The feature maps input to the Neck are P3, P4, and P5. After passing through the first BiFPN, P3 to P7 with deeper features are obtained and input into the next BiFPN structure. After the concatenation of four BiFPN structures, the output of the Neck network is obtained, with feature map sizes of 5*3, 10*6, 20*12, 40*24, and 80*48, respectively.

[0035] Preferably, the road drivable area segmentation uses a semantic segmentation algorithm for pixel-by-pixel recognition and classification, accepts feature information from the backbone network and the neck network, and uses a BiFPN structure and FCN for classification, which involves multiple upsampling operations, which are replaced by RepU psample; the road drivable area segmentation uses cross-entropy loss.

[0036] Preferably, the target detection adopts an anchor-based detection scheme, which pre-sets prior boxes, determines the probability of a target in each grid and the probability of each target category based on feature maps at different scales, and finally removes duplicate detection boxes by non-maximum suppression (NMS) to obtain the final detection result; the target detection loss function includes target classification loss, target localization loss and target confidence loss.

[0037] Preferably, the lane line segmentation is based on key point detection, which divides the image horizontally into multiple strips, then divides each strip into multiple blocks, predicts the position of the lane line in each strip, detects the position of key points on the lane line, connects key points belonging to the same lane line into a line, and calculates the loss using cross-entropy.

[0038] The advantages of this invention are:

[0039] Compared to ordinary linear interpolation and transposed convolution, this invention uses RepUpsample to improve the accuracy of the network model. In semantic segmentation tasks, comparing the accuracy performance of DeepLabv3, FPN, and U-Net models using different upsampling modules, RepUpsample, as an upsampling method, can improve the performance of semantic segmentation networks regardless of the network model, upsampling location, or network size. Compared to bilinear interpolation, mIOU improves accuracy by an average of 1.77%, and PA improves it by an average of 0.74%. Compared to transposed convolution, mIOU improves accuracy by 1.16%, and PA improves it by 0.35%. Attached Figure Description

[0040] The present invention will be further described below with reference to the accompanying drawings and embodiments:

[0041] Figure 1 The RepUpsample structure diagram is shown below;

[0042] Figure 2 This is a diagram showing the overall structure of a multi-task perception network.

[0043] Figure 3 This is a diagram of the Backbone network structure.

[0044] Figure 4 This is a diagram of the BiFPN network structure. Detailed Implementation

[0045] The present invention proposes a driving environment perception method based on an in-vehicle monocular camera, which includes structural reparameterization of the upsampling module and multi-task perception for autonomous driving.

[0046] S1. Perform structural reparameterization on the upsampling module.

[0047] Upsampling plays an irreplaceable role in neural networks, with many network models requiring it to recover feature map scale and fuse multi-channel information. Common upsampling modules include linear interpolation and transposed convolution. Linear interpolation has no learnable parameters, resulting in fast inference speed but weak expressive power; transposed convolution, on the other hand, has learnable parameters and stronger expressive power. In fact, linear interpolation is a special type of transposed convolution, and every linear interpolation can be losslessly replaced by a transposed convolution. Based on this, and combined with the idea of ​​reparameterization, we propose RepUpsample, a reparameterization method for upsampling layer structures.

[0048] During training, a transposed convolutional layer is expanded into a multi-branch layer. One branch uses a standard linear interpolation algorithm, while the other branches use transposed convolutions with different kernel sizes. Linear interpolation can only change the size of the feature map, not the number of channels in the output feature map; this can be changed by adding 1x1 convolutional layers. Additionally, batch normalization normalizes the feature map, improving the network's generalization ability. Adding batch normalization after each upsampling branch can further improve model performance. Adding an interpolation branch essentially provides a skip connection to the transposed convolution, allowing it to focus on learning residuals.

[0049] During the inference phase, the multi-branch structure can be reparameterized and losslessly transformed into a single-branch structure. Taking double upsampling as an example, during training, the three branches use bilinear interpolation followed by 1*1 convolutions, 2*2 transposed convolutions, and 4*4 transposed convolutions, respectively, and each is then connected to a batch normalization layer, such as... Figure 1 As shown. Input feature map The feature map is obtained after upsampling.

[0050] (1) 4*4 transposed convolution branch

[0051] The kernel of a 4×4 transposed convolution is bias Since it is a transposed convolution, padding is required before convolution. The convolution kernel in the convolution process is as follows: Their correspondence is shown in Formula 1, first transpose The first two dimensions, then reversed The last two dimensions.

[0052]

[0053] After convolution, the parameters are normalized by a batch normalization (BN) layer. During inference, convolution can fuse the parameters of the BN layer, achieving inference acceleration. The fused transposed convolution parameters are shown in Equations 2 and 3.

[0054]

[0055]

[0056] Where γ, β, σ, and μ correspond to the weights, biases, variances, and means of the BN layer, respectively, yielding the weight W. 4×4 Finally, the transposed convolution weights are obtained through the transformation in Formula 3.

[0057] (2) 2*2 transposed convolution branch

[0058] The kernel of a 2×2 transposed convolution is The weights can be obtained by padding the outer edge of a 2×2 convolutional kernel with zeros to a size of 4×4, and then fusing the parameters of the BN layer.

[0059] (3) Bilinear interpolation followed by a 1*1 convolution branch

[0060] Bilinear interpolation layers can be losslessly transformed into transposed convolutions with a 4×4 kernel size, whose convolution kernel... Unbiased, but with a 1×1 convolution kernel that changes the number of channels. bias The resulting 4×4 transposed convolution is first fused with the parameters of a 1×1 convolution, and the new weights are shown in formulas 4 and 5:

[0061] W bilinear ←W 1×1 ×W bilinear (4)

[0062] b bilinear ←b 1×1 (5)

[0063] Finally, the parameters of the BN layer are combined to obtain the weights.

[0064] (4) Multi-branch fusion

[0065] After each of the three branches is independently transformed into a 4×4 transposed convolution, the weights and biases of the three transposed convolutions are added together to obtain the final result, as shown in Equations 6 and 7:

[0066]

[0067]

[0068] By reparameterizing the structure, the complex multi-branch structure during training can be compressed without loss, improving inference efficiency while maintaining accuracy.

[0069] S2, Autonomous Driving Multi-Task Perception

[0070] Autonomous driving scenarios prioritize both accuracy and speed. A multi-task deep learning model enables real-time perception and reasoning based on a monocular camera, completing three tasks: object detection, drivable road segmentation, and lane segmentation. The algorithm model employs an encoder-decoder network structure. Different decoding head networks are used for each of the three tasks, while the encoder network is shared. The encoder network is further divided into a backbone and a neck structure based on their roles and positions. The backbone directly receives input from the camera and is located in a relatively shallow layer of the network, used to extract shallow feature information. The neck, located in the middle of the network, receives feature information from the backbone, performs further feature fusion and extraction to obtain deeper feature information, and then passes this information to the different decoding networks. The overall network structure is as follows: Figure 2 As shown.

[0071] (1) Backbone Network

[0072] The Backbone network employs the classic ResNet neural network, which boasts excellent performance in image feature extraction. Images captured by a monocular camera are scaled to 640*340 pixels and processed through multiple residual structures to obtain feature maps of varying sizes, P1 to P5. Figure 3 As shown.

[0073] (2) Neck Network

[0074] The Neck network employs an improved BiFPN structure. The bidirectional pyramid structure facilitates the generation and fusion of features at different scales, resulting in feature maps that simultaneously contain multi-scale and multi-semantic information. The BiFPN structure includes numerous upsampling operations; replacing the upsampling in the BiFPN with reparameterizable RepUpsamples improves the flexibility of the upsampling module, better adapts to changes in feature map scale, and enhances the overall network performance. A single BiFPN structure is shown below. Figure 4 As shown.

[0075] Multiple BiFPN connections form the Neck structure of the entire perceptual network. The feature maps input to the Neck are P3, P4, and P5. After passing through the first BiFPN, P3 to P7, which have deeper features, are obtained and input into the next BiFPN structure. After concatenating four BiFPN structures, the output of the Neck network is obtained, with feature map sizes of 5*3, 10*6, 20*12, 40*24, and 80*48, respectively.

[0076] (3) Head network for segmenting drivable road areas

[0077] The road drivable area segmentation employs a semantic segmentation algorithm for pixel-by-pixel recognition and classification, receiving feature information from the backbone and neck. It also uses a Bi FPN structure and FCN for classification, involving numerous upsampling operations, which are replaced by Rep Upsample. Cross-entropy loss is used for road drivable area segmentation.

[0078] (4) Target Detection Header Network

[0079] The object detection head is similar to the YOLOv5 network, employing an anchor-based detection scheme. It pre-defines prior bounding boxes and, based on feature maps at different scales, determines the probability of an object being present in each grid cell and the probability of each object category. Finally, non-maximum suppression (NMS) is used to remove duplicate detection boxes, yielding the final detection result. The object detection loss function includes object classification loss, object localization loss, and object confidence loss.

[0080] (5) Lane detection head network

[0081] Lane detection is based on keypoint detection. It horizontally divides the image into multiple strips, then further divides each strip into multiple blocks, predicting the position of the lane line within each strip. Compared to semantic segmentation methods for lane detection, using keypoints significantly reduces the computational cost of the network model. Lane detection obtains the positions of keypoints on the lane lines, connects keypoints belonging to the same lane line, and calculates the loss using cross-entropy.

[0082] Compared to ordinary linear interpolation and transposed convolution, Rep Upsample improves the accuracy of network models. Table 1 shows the accuracy performance of DeepLabv3, FPN, and U-Net models when using different upsampling modules in semantic segmentation tasks.

[0083] Table 1. Impact of different upsampling methods on semantic segmentation networks

[0084]

[0085]

[0086] Based on experimental results, Rep Upsample, as an upsampling method, can improve the performance of semantic segmentation networks regardless of the network model, upsampling location, or network size. Compared to bilinear interpolation, mIOU improves performance by an average of 1.77%, and PA improves performance by an average of 0.74%. Compared to transposed convolution, mIOU improves performance by 1.16%, and PA improves performance by 0.35%.

[0087] Autonomous driving technology is still in a stage of rapid development. It's easy to imagine that future human transportation will be based on artificial intelligence and driven by autonomous vehicles. However, currently, autonomous driving technology is still immature. In addition to high software requirements, it also has additional hardware requirements. To achieve accurate perception, it requires the cooperation of multiple sensors, such as LiDAR, millimeter-wave radar, cameras, and inertial measurement units. Therefore, equipping older cars with a relatively intelligent in-vehicle AI system is often very difficult.

[0088] The network model proposed in this patent is based on a monocular camera, which has low hardware requirements as most vehicles are equipped with it, such as the camera in a dashcam. It also requires relatively low computational power. When pedestrians or lane departures are detected, it can provide timely warnings to the driver, thus assisting driving to some extent. Furthermore, with the development of intelligent transportation, vehicle-to-vehicle and vehicle-to-infrastructure (V2I) communication will become possible in the future, allowing different vehicles to exchange information at intersections and provide warnings about blind spots.

[0089] my country's Skynet surveillance system has extensive coverage, with cameras installed at most urban intersections to monitor traffic violations. The network model proposed in this patent can also be applied to intersection monitoring equipment, providing artificial intelligence for edge devices. Through real-time inference, intersection cameras can detect traffic flow and control traffic light changes accordingly to ensure efficient vehicle passage at intersections. They can also detect pedestrians running red lights and issue warnings, as well as detect speeding, illegal parking, and other violations, recording vehicle license plate numbers and reporting them to regulatory authorities.

[0090] The above embodiments are only for illustrating the technical concept and features of the present invention, and are intended to enable those skilled in the art to understand the content of the present invention and implement it accordingly. They should not be construed as limiting the scope of protection of the present invention. All modifications made according to the spirit and essence of the main technical solution of the present invention should be covered within the scope of protection of the present invention.

Claims

1. A driving environment perception method based on an in-vehicle monocular camera, characterized in that, include: S1. Perform structural reparameterization on the upsampling module: This includes the training and inference phases, where... During the training phase, a transposed convolutional layer of the upsampling module is expanded into a multi-branch layer, with one branch using a linear interpolation algorithm and the other branches using transposed convolutions with different kernel sizes. During the inference phase, the multi-branch structure is reparameterized and transformed into a single-branch structure without loss. During the inference phase, for double upsampling, the three branches during training use bilinear interpolation followed by 1x1 convolutions, 2x2 transposed convolutions, and 4x4 transposed convolutions, respectively, and are then connected to batch normalization layers; input feature maps The feature map is obtained after upsampling. ; The kernel of a 4×4 transposed convolution is bias ; Since it is a transposed convolution, padding is required before convolution. The convolution kernel in the convolution process is as follows: The correspondence is as shown in Formula 1, first transpose. The first two dimensions, then reversed The last two dimensions; After convolution, the parameters are normalized by a BN layer. During inference, the convolution merges the parameters of the BN layer, and the merged transposed convolution parameters are shown in Equations 2 and 3. Where γ, β, σ, and μ correspond to the weights, biases, variances, and means of the BN layer, respectively, to obtain the weights. Finally, the transposed convolution weights are obtained through the transformation in Formula 3. ; The kernel of a 2×2 transposed convolution is The weights are obtained by padding the outer ring of a 2×2 convolutional kernel with zeros to a size of 4×4, and then fusing the parameters of the BN layer. ; Bilinear interpolation followed by a 1x1 convolution is losslessly transformed into a transposed convolution with a 4x4 kernel. The resulting 4x4 transposed convolution has a specific kernel size. Unbiased, but with a 1×1 convolution kernel that changes the number of channels. bias The resulting 4×4 transposed convolution is first fused with the parameters of a 1×1 convolution, and the new weights are shown in formulas 4 and 5. Finally, the parameters of the BN layer are combined to obtain the weights. ; After each of the three branches is independently transformed into a 4×4 transposed convolution, the weights and biases of the three transposed convolutions are added together to obtain the final result, as shown in Equations 6 and 7: By reparameterizing the structure, the complex multi-branch structure during training can be compressed without loss. S2, Multi-task perception for autonomous driving: Through a multi-task deep learning model, real-time perception reasoning based on a monocular camera is achieved, completing three tasks: target detection, road drivable area segmentation, and lane line segmentation.

2. The driving environment perception method based on an in-vehicle monocular camera according to claim 1, characterized in that, During the training phase in S1, linear interpolation is performed by adding 1x1 convolutional layers to change the number of channels in the output feature map; batch normalization is added after each upsampling branch to further improve model performance.

3. The driving environment perception method based on an in-vehicle monocular camera according to claim 1, characterized in that, In S2, the overall structure of the multi-task deep learning model adopts an encoder-decoder network structure. For the three tasks, different decoder head networks are used respectively, while the encoder network is shared. The encoder network is further divided into a backbone network and a neck network according to different functions and positions. The backbone network directly receives input from the camera and is used to mine shallow feature information of the network. The neck network receives feature information from the backbone network, performs further feature fusion and feature extraction, obtains deeper feature information, and passes it to different decoder networks.

4. The driving environment perception method based on an in-vehicle monocular camera according to claim 3, characterized in that, The Backbone network uses ResNet. The image captured by the monocular camera is scaled to 640*340 and then processed through a multi-layer residual structure to obtain feature maps of different sizes.

5. The driving environment perception method based on an in-vehicle monocular camera according to claim 3, characterized in that, The Neck network adopts an improved BiFPN structure, which contains multiple upsampling operations. The upsampling in BiFPN is replaced with reparameterizable RepUpsample. Multiple BiFPN connections constitute the Neck structure of the entire perceptual network. The feature maps input to the Neck are P3, P4, and P5. After passing through the first BiFPN, P3 to P7 with deeper features are obtained and input into the next BiFPN structure. After the concatenation of the four BiFPN structures, the output of the Neck network is obtained, with feature map sizes of 5*3, 10*6, 20*12, 40*24, and 80*48, respectively.

6. The driving environment perception method based on an in-vehicle monocular camera according to claim 5, characterized in that, The road drivable area segmentation uses a semantic segmentation algorithm for pixel-by-pixel recognition and classification. It receives feature information from the backbone network and the neck network, and uses a BiFPN structure and FCN for classification. This involves multiple upsampling operations, which are replaced by RepUpsample. The road drivable area segmentation uses cross-entropy loss.

7. The driving environment perception method based on an in-vehicle monocular camera according to claim 6, characterized in that, The target detection adopts an anchor-based detection scheme, which pre-sets prior boxes and determines the probability of a target in each grid and the probability of each target category based on feature maps at different scales. Finally, non-maximum suppression (NMS) is used to remove duplicate detection boxes to obtain the final detection result. The target detection loss function includes target classification loss, target localization loss, and target confidence loss.

8. The driving environment perception method based on an in-vehicle monocular camera according to claim 7, characterized in that, The lane line segmentation is based on key point detection. The image is horizontally divided into multiple strips, and each strip is further divided into multiple blocks. The position of the lane line in each strip is predicted. The key point position on the lane line is obtained by lane line detection. Key points belonging to the same lane line are connected into a line, and the loss is calculated using cross-entropy.