A High-Speed Road Obstacle Recognition Method Based on Haar Feature and Depth Estimation Feature Concatenation
By cascading Haar features and depth estimation features, and combining shadow segmentation and multi-feature fusion, the problem of low obstacle recognition accuracy in the transportation field is solved, achieving higher accuracy obstacle recognition and dynamic object detection.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SOUTHEAST UNIV
- Filing Date
- 2025-06-06
- Publication Date
- 2026-06-30
AI Technical Summary
Existing target detection algorithms have low localization accuracy for small or heavily occluded targets in the traffic field, and are greatly affected by lighting, resulting in low obstacle recognition accuracy and poor dynamic object recognition capabilities, especially in complex scenarios.
We employ a method based on the concatenation of Haar features and depth estimation features, and combine shadow segmentation, multi-feature fusion and attention mechanisms with GPS to reconstruct 3D models, thereby improving obstacle recognition accuracy and the ability to identify dynamic objects.
It effectively improves the recognition accuracy of road obstacle targets and the target recognition capability of dynamic objects, makes up for the shortcomings of lighting effects, and achieves more accurate obstacle recognition.
Smart Images

Figure CN120689839B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of depth estimation and computer target detection technology, specifically relating to a method for identifying obstacles on expressways based on the cascade of Haar features and depth estimation features. Background Technology
[0002] With the advancement of science and technology, the application of computer object detection in the transportation field has gradually expanded, encompassing intelligent transportation systems, autonomous driving, and traffic monitoring. Research on computer object detection in the transportation field primarily focuses on the accurate detection and recognition of targets such as pedestrians, vehicles, traffic signs, and traffic lights. The development of deep learning methods and the enrichment of datasets have provided more opportunities and challenges for object detection research in the transportation field. As technology advances and application demands increase, object detection research in the transportation field will continue to develop and evolve.
[0003] Existing object detection algorithms still have some errors in object localization, especially for small or heavily occluded targets, where the localization accuracy is relatively low. Object detection algorithms perform poorly in complex scenes, such as those with dense targets, overlapping targets, or occlusion. Targets in these scenes are difficult to detect and localize accurately. Some object detection algorithms are ineffective at detecting small targets because small targets have weak visual features and are easily masked by background noise, leading to inaccurate detection results.
[0004] Existing object detection algorithms in the transportation field rarely consider the influence of lighting, leading to the identification of object shadows as objects. Furthermore, their ability to track and estimate objects in dynamic scenes is weak. Simultaneously, they perform poorly in depth estimation in areas with missing or indistinct textures, resulting in an inability to fully describe the object's attributes and category information. This further leads to confusion between objects and false detections. Summary of the Invention
[0005] To address the shortcomings of existing technologies, this invention provides a method for identifying obstacles on expressways based on the cascaded use of Haar features and depth estimation features. This method effectively solves the problems of low obstacle recognition accuracy due to the influence of lighting and poor target recognition capability for dynamic objects in road obstacle target detection.
[0006] To achieve the above objectives, the technical solution adopted by the present invention is as follows:
[0007] Firstly, a method for identifying obstacles on expressways is provided, comprising: acquiring road images and corresponding depth data from vehicles traveling on expressways; extracting depth features of the road images from the depth data, and performing shadow segmentation on the acquired road images in combination with the depth features of the road images to obtain segmented images; extracting Haar features and depth estimation features from the segmented images and performing parallel matching processing; introducing a similarity-based attention mechanism to fuse the matching results of Haar features and depth estimation features to obtain a fused feature vector; performing thresholding processing on the fused feature vector to obtain depth estimation results; reconstructing a three-dimensional model of the obstacle based on the depth estimation results; and combining GPS to achieve the identification of obstacles on expressways.
[0008] Furthermore, shadow segmentation is performed on the acquired road images, including: estimating scene lighting conditions using depth data; establishing a shadow mapping model based on the described lighting conditions to determine the shadow content of the image and analyze shadow information to distinguish between objects and shadows; shadow regions have different depth features, and shadow regions are segmented by combining depth information with the image; during actual rendering, the shadow mapping model uses shadow maps to determine whether a pixel is in shadow; the current pixel is transformed from the world coordinate system to the clipping space coordinate system of the light source's viewpoint, and the corresponding depth value is obtained from the shadow map using the corresponding coordinates; then, the depth value of the current pixel is compared with the depth value in the shadow map, and if the depth value of the current pixel is greater than the depth value in the shadow map, it means that the pixel is in shadow.
[0009] Furthermore, scene lighting conditions are estimated using depth data, including: acquiring a depth image of the scene using a depth sensor; introducing Gaussian smoothing to smooth the normals; and introducing an adaptive Gaussian kernel to dynamically adjust... Parameters are used to balance noise suppression and edge preservation; a Gaussian kernel is applied to each pixel location of the depth image:
[0010] ,
[0011] in, It is the standard deviation of the basic Gaussian filter; k is the proportional coefficient, which controls... Sensitivity to gradients; It is the magnitude of the image gradient, calculated using the Sobel operator; It represents the standard deviation of the Gaussian kernel at pixel position (u,v);
[0012] For each pixel, perform depth image 3D gradient and normalization calculation:
[0013] ,
[0014] in, It is the normalized normal vector; D is the depth image; , These are the rates of change of the depth value in the x and y directions, respectively;
[0015] Calculate the plane fitting normal, combine the RANSAC algorithm and the least squares method to fit the plane, remove outliers, and estimate the normal;
[0016] ,
[0017] in, ... , These are the two-dimensional pixel coordinates of the i-th point. It is the depth value of the i-th point;
[0018] A multi-feature fusion method is used to detect shadow regions in an image, utilizing feature vectors. It integrates depth features, normal features, intensity features, and RGB color features;
[0019] ,
[0020] in, It is the normal vector Direction angle; It is the intensity information of pixel i; , , These are the RGB channel color information of pixel i; It is a residual; It is a combined feature vector used to describe the multimodal features of pixel i;
[0021] ,
[0022] in, is the unit normal vector at pixel i; L is the illumination direction; k is an empirical coefficient used to adjust the intensity of illumination compensation; It is ambient light;
[0023] In the optimization objective of illumination estimation, in addition to considering pixels in non-shaded areas, feature vectors in shaded areas are also considered. The feature vectors of shaded areas are compared with those of non-shaded areas to improve the accuracy of illumination estimation, thus deriving the optimization objective of illumination estimation:
[0024] ,
[0025] in, It is the average value of the feature vectors of the non-shaded regions.
[0026] Furthermore, depth estimation features are extracted, including: utilizing binocular vision depth estimation and using features at different levels to extract depth information; low-level features: converting the binocular image to a grayscale image and extracting the grayscale value of each pixel as a low-level feature.
[0027] ,
[0028] in, It is the gray value at pixel (x,y); , , They are x , y , z Direction of light; It is the magnitude of the light direction vector;
[0029] Mid-level features: Each pixel and its neighborhood in the image are binary encoded to extract texture information as features; rotation-invariant CLBP with multiple window sizes of 3×3, 5×5, and 7×7 is used to form multi-scale texture features.
[0030] ,
[0031] ,
[0032] in, It is the local standard deviation, used to enhance the robustness of noise; It is a set of features of multi-scale rotation-invariant local binary patterns. In the entire grayscale image Above, CLBP features are calculated using a window size of s×s, where s is the window size of CLBP. It is the local binary code value at pixel (x,y), reflecting the contrast relationship between this point and its neighboring pixels. It is a binary function. It is the grayscale value of the center pixel (x, y). It is the grayscale value of the neighboring pixel (u,v);
[0033] High-level features: The image after shadow segmentation is divided into multiple regions, and features such as shape, texture and color of each region are extracted to represent high-level information.
[0034] Further, Haar features are extracted, including: first, HOI cropping is performed on the image to remove non-road areas on both sides; then, the expanded Haar features are applied, and by sliding across the image, the pixel covered by the white area minus the pixel covered by the black area is the feature response value; in obstacle recognition, multiple Haar features at different locations and scales are weighted and averaged to capture multiple local features of the obstacle and obtain the final feature fusion response value; finally, the feature fusion response value is input into the AdaBoost classifier to obtain the candidate region of the obstacle.
[0035] Furthermore, the matching results of Haar features and depth estimation features are fused, including: independently converting the extracted subsets of Haar features and the subsets of depth estimation features into vector form; the Haar low-level feature vector is represented as:
[0036] ,
[0037] in, It is a feature vector formed by transforming a subset of Haar features; It is the first Haar low-level eigenvalue. It is the Nth Haar low-level eigenvalue;
[0038] In depth estimation, low-level features are represented by pixel value vectors, which arrange the pixel values of the image into a vector in a set order, with each pixel value serving as a dimension of the vector.
[0039] The low-level feature vector of depth estimation is represented as:
[0040] ,
[0041] in, It is a vector composed of low-level pixel features extracted by the depth estimation task; It is the feature value of the first pixel position. It is the feature value at the Lth pixel position;
[0042] Intermediate features are represented by statistical feature vectors. Texture information is extracted using local binary mode, and the LBP histogram of each local region is used as one dimension of the vector.
[0043] High-level features are represented by region descriptors, which are then combined into a vector. For each region segmented from an image, the area, perimeter, circularity, color histogram, and color moments of each region are calculated to represent its features using region descriptors.
[0044] Define the complete set of Haar features The complete set of deep features :
[0045] ,
[0046] The extracted low, medium, and high-level Haar features and the depth-estimated hierarchical features are divided into multiple subsets; each subset represents a specific feature.
[0047] Will Divide into K subsets, each subset corresponding to a local feature or semantic region:
[0048] ,
[0049] Similarly, Divide into K subsets
[0050] ,
[0051] The dimensions of the Haar feature vector and the hierarchical feature vector of the depth estimation are matched. If the dimensions of the two feature vectors are inconsistent, the dimensions are made up by adding extra zero elements to the vector with smaller dimensions.
[0052] Parallel feature matching computation is performed on each subset; the depth estimation results are used to assist the process of matching depth-estimated features and Haar features, employing depth-nearest neighbor matching.
[0053] ,
[0054] in, For the i-th depth feature subset; For the i-th Haar feature subset;
[0055] Let the depth of the depth estimation feature subset be D1, and the depth of the Haar feature subset be D2.
[0056] ,
[0057] Where C is the confidence level of the depth estimate;
[0058] Triangulation is used to calculate the distance between matching points. A threshold is set to filter out the closest matching point pairs (A, B). For depth estimation feature A, the one with the smallest distance is selected as the nearest neighbor match.
[0059] Further, the distance between matching points is calculated, including: first, receiving pre-estimated lighting condition data for lighting compensation and camera calibration; then calculating parallax; finally, converting parallax to depth. This involves first converting the parallax and image coordinates to camera coordinates, and then further converting them to world coordinates to obtain the depth value in the world coordinate system. The specific formula after lighting compensation is as follows:
[0060] ,
[0061] ,
[0062] Where B is the baseline distance; L is the focal length; and d is the parallax. As the light compensation factor, This represents the average grayscale value of the input image.
[0063] Furthermore, fusing the matching results of Haar features and depth estimation features also includes: inputting the depth feature vector. and its corresponding Haar feature vector All dimensions are K; similarity measures are used to calculate... and Calculate the similarity or distance between them; obtain a similarity value sim; convert the similarity value sim into attention weights, representing... and The similarity between the two values is analyzed; the softmax function is used to convert the similarity values into a probability distribution; the attention weight is set as attention; and the similarity is calculated. and We perform weighted parallel fusion, using attention weights as weight coefficients, to obtain the fused feature vector. The feature vectors are concatenated to obtain the fused feature vector.
[0064] ,
[0065] in, It is the j-th similarity value. It is the feature vector after feature fusion.
[0066] Furthermore, based on the depth estimation results, a 3D model reconstruction of the obstacle is performed, including: acquiring a depth map, first mapping the depth values from the object coordinate system or camera coordinate system to the normalized device coordinate system; then, based on the camera's intrinsic parameters, mapping the normalized device coordinates to the pixel coordinate system.
[0067] ,
[0068] The range of the label values is: The range of pixel values in the depth map is For each pixel (x, y) in the depth map, the corresponding classifier output label is: , These are the pixel values of the mapped depth map;
[0069] Each pixel position in the depth image is converted into a normalized image plane coordinate system. Using the depth value and the corresponding normalized image plane coordinates, 3D point cloud data is generated. The depth value and the corresponding image coordinates are mapped to the 3D coordinate system to obtain the 3D position of each point. The 3D coordinates of each point are combined into a point cloud dataset. The points in the point cloud are connected into triangular patches to form a continuous 3D mesh model. The generated 3D mesh model is optimized to complete the 3D reconstruction.
[0070] In a second aspect, a highway obstacle recognition system is provided, comprising a storage medium and a processor; the storage medium is used to store instructions; the processor is used to operate according to the instructions to execute the highway obstacle recognition method described in the first aspect.
[0071] Compared with existing technologies, the beneficial effects achieved by this invention are as follows: This invention acquires road images and corresponding depth data from vehicles traveling on expressways; extracts depth features from the depth data of the road images, and performs shadow segmentation on the acquired road images in combination with the depth features of the road images to obtain segmented images; extracts Haar features and depth estimation features from the segmented images and performs parallel matching processing; introduces a similarity-based attention mechanism to fuse the matching results of Haar features and depth estimation features to obtain a fused feature vector; performs thresholding processing on the fused feature vector to obtain depth estimation results; reconstructs 3D models of obstacles based on the depth estimation results; and combines GPS to achieve the identification of obstacles on expressways, thereby improving the recognition accuracy of road obstacle target detection and the target recognition capability of dynamic objects, effectively compensating for the shortcomings of Haar target detection affected by illumination and the target recognition capability of dynamic objects. Attached Figure Description
[0072] Figure 1 This is a schematic diagram of the main implementation process of an obstacle recognition method for expressways based on the concatenation of Haar features and depth estimation features provided in this embodiment of the invention;
[0073] Figure 2 This is a schematic diagram of the ROI region in an embodiment of the present invention;
[0074] Figure 3 This is a schematic diagram of the improved Haar feature classifier in an embodiment of the present invention. Detailed Implementation
[0075] The present invention will be further described below with reference to the accompanying drawings. The following embodiments are only used to more clearly illustrate the technical solution of the present invention, and should not be used to limit the scope of protection of the present invention.
[0076] Example 1
[0077] like Figure 1 As shown, a method for identifying obstacles on expressways based on the cascaded use of Haar features and depth estimation features includes: acquiring road images and corresponding depth data from vehicles traveling on expressways; extracting depth features from the depth data of the road images and performing shadow segmentation on the acquired road images in combination with the depth features of the road images to obtain segmented images; extracting Haar features and depth estimation features from the segmented images and performing parallel matching processing; introducing a similarity-based attention mechanism to fuse the matching results of Haar features and depth estimation features to obtain a fused feature vector; performing thresholding processing on the fused feature vector to obtain depth estimation results; reconstructing a 3D model of the obstacle based on the depth estimation results; and combining GPS to achieve the identification of obstacles on expressways.
[0078] Step 1: Collect the vehicle image dataset D and the depth dataset E from the depth sensor.
[0079] The vehicle image dataset D needs to distinguish between positive samples (with obstacles) and negative samples (without obstacles), and the obstacles in the positive samples need to be labeled. To improve the model's generalization ability and sensitivity to lighting conditions, the dataset should include different types of obstacles and samples under different lighting conditions. If the data volume is insufficient, data augmentation techniques such as image flipping, scaling, and cropping can be used to expand the dataset. A depth dataset E is collected, and filters are used to denoise the depth data to improve data quality. Since subsequent steps require depth estimation using binocular stereo vision, camera calibration is necessary to obtain the camera's intrinsic and extrinsic parameters to ensure the accuracy of the depth data.
[0080] The color images in the vehicle image dataset D are converted to grayscale images, and then the Sobel operator is used to calculate the gradient values in the horizontal and vertical directions on the grayscale images.
[0081] Horizontal gradient value ,
[0082] vertical gradient value ,
[0083] in, arrive It represents the grayscale values of the eight neighboring pixels centered on the current pixel.
[0084] For each pixel, calculate the autocorrelation matrix M based on the gradient values within the window using OpenCV:
[0085] ,
[0086] in, It represents the intensity of grayscale change of a pixel in the horizontal direction. It is the intensity of the grayscale change of a pixel in the vertical direction. It is the total intensity of the horizontal edges within the window. It is the total intensity of the vertical edges within the window.
[0087] For each pixel, calculate the eigenvalues of the autocorrelation matrix. and .
[0088] For each corner point, its pixel coordinates (x, y) in the image are paired with its actual world coordinates (X, Y, Z) on the calibration board.
[0089] Calculate the corner response function ,
[0090] in, Let M be the determinant of the autocorrelation matrix M, and trace(M) be the trace of the autocorrelation matrix M. k is a constant (usually a small value, 0.04 - 0.06).
[0091] For each pixel, its corner response function value is compared with a pre-set threshold. If the value is greater than the threshold, it is considered a corner. The threshold is gradually adjusted from small to large until a threshold of 0.6 is obtained.
[0092] Step 2: Obtain the depth features of the road image from the depth data, and combine the depth features of the road image to perform shadow segmentation on the acquired road image to obtain the segmented image.
[0093] Using the depth dataset E, scene lighting conditions are estimated. Based on these conditions, a shadow mapping model is built to determine the shadow content of the image and analyze shadow information to distinguish between objects and shadows. Shadow regions have different depth features; by combining depth information with the image, shadow region segmentation is achieved.
[0094] The specific steps for estimating scene lighting conditions using the depth dataset E are as follows:
[0095] Use a depth sensor (binocular stereo vision) to acquire depth images of the scene.
[0096] To reduce the impact of noise and the discontinuity of the normal, Gaussian smoothing is introduced to smooth the normal, and an adaptive Gaussian kernel is introduced to dynamically adjust the normal. Parameters are used to balance noise suppression and edge preservation. A Gaussian kernel is applied to each pixel location of the image:
[0097] ,
[0098] in, It is the standard deviation of the basic Gaussian filter (basic) (Value); k is the proportionality coefficient, which controls... Sensitivity to gradients; It is the magnitude of the image gradient, calculated using the Sobel operator; It represents the standard deviation of the Gaussian kernel at pixel position (u,v).
[0099] For each pixel, perform depth image 3D gradient and normalization calculation:
[0100] ,
[0101] in, It is the normalized normal vector; D is the depth image; , These are the rates of change of the depth value in the x and y directions, respectively.
[0102] Calculate the plane fitting normal, combine the RANSAC algorithm and the least squares method to fit the plane, remove outliers, and estimate the normal;
[0103] ,
[0104] in, These are the weights; the closer the distance, the greater the weight. a, b, c, and d are the plane parameters, and the normal is (a, b, c). N is the total number of points participating in the plane fitting (i.e., the number of points in the point cloud or local neighborhood). , It represents the two-dimensional pixel coordinates (image coordinate system) of the i-th point. It is the depth value of the i-th point.
[0105] Using depth and normal information, shadow regions in an image are detected, and then lighting conditions are inferred from the regions outside the shadows. A multi-feature fusion method is used to detect shadow regions in the image. Feature vectors are utilized. It integrates depth features, normal features, intensity features, and RGB color features;
[0106] ,
[0107] in, It is the normal vector Direction angle; It is the intensity information of pixel i; , , These are the RGB channel color information of pixel i; It is a residual; It is a combined feature vector used to describe the multimodal features of pixel i;
[0108] ,
[0109] in, is the unit normal vector at pixel i; L is the illumination direction; k is an empirical coefficient used to adjust the intensity of illumination compensation; It is ambient light.
[0110] In the optimization objective of illumination estimation, in addition to considering pixels in non-shaded areas, feature vectors in shaded areas are also considered. The feature vectors of shaded areas are compared with those of non-shaded areas to improve the accuracy of illumination estimation, thus deriving the optimization objective of illumination estimation:
[0111] ,
[0112] in, It is the average value of the feature vectors of the non-shaded regions.
[0113] A shadow mapping model is established based on the described lighting conditions to determine the shadow content of the image.
[0114] First, a view matrix needs to be constructed based on the position and direction of the light source. and projection matrix Then, the scene is rendered from the light source's perspective, using a depth map to store the depth value from the light source to each pixel in the scene. Next, for each vertex V in the scene, its clip space coordinates from the light source's perspective are calculated:
[0115] ,
[0116] And And convert to NDC coordinates From the perspective of the main traffic police officer, the scene is rendered normally. During actual rendering, shadow maps are used to determine if a pixel is in shadow. The current pixel is transformed from world coordinates to the clipping space coordinates of the light source's view, and the corresponding depth value is obtained from the shadow map using these coordinates. Then, the depth value of the current pixel (depth dataset E) is compared with the depth value in the shadow map. If the depth value of the current pixel is greater than the depth value in the shadow map, it means that the pixel is in shadow.
[0117] Step 3: Extract Haar features and depth estimation features from the segmented image and perform parallel matching processing.
[0118] Representative and discriminative features are selected from the extracted Haar feature set, including low-level features (edges and lines), mid-level features (image textures and local patterns), and high-level features (obstacle contours). Haar features effectively capture key information in the image. Furthermore, considering the differences in representation and information content between depth estimation hierarchical features and Haar features, they are processed separately using parallel matching. This ensures the comprehensiveness and accuracy of feature extraction, providing a foundation for subsequent feature fusion.
[0119] The Haar feature extraction algorithm extracts image features as follows.
[0120] First, perform HOI cropping on the image to remove non-road areas such as buildings, railings, roadside trees, and sky. Specifically, as follows... Figure 2 As shown, the area within the red box is the ROI region.
[0121] To extract features from various obstacles, this invention employs extended Haar features. By sliding across the image, the feature response value is obtained by subtracting the pixels covered by black areas from the pixels covered by white areas.
[0122] In obstacle recognition, multiple Haar features at different locations and scales are weighted and averaged to capture multiple local features of the obstacle and obtain the final feature fusion response value.
[0123] Because the image being extracted is after shadow segmentation, the AdaBoost classifier is improved. The specific AdaBoost classifier is as follows: Figure 3 As shown.
[0124] The improved AdaBoost classifier employs a hierarchical cascaded structure and a multi-feature fusion strategy. First, a scene classifier (G1) is introduced to quickly distinguish scene categories in the input image, specifically moving vehicles on a highway, road signs and markings, and shadows of buildings or vehicles. An obstacle recognition classifier is added to the shadow classifier, outputting a region of interest (ROI) to narrow down the scope of subsequent processing. Building upon this, multiple cascaded feature classifiers (G2,...,Gn) are designed to perform fine-grained classification for different semantic targets. Each classifier selects the most suitable features based on the target characteristics and is trained independently to avoid interference between features, achieving feature diversity and targeted optimization of the classifiers.
[0125] For classifier optimization, dynamic feature weight adjustment is adopted, dynamically changing feature importance based on the classification results of the previous stage. The shadow classifier reduces the weight of color features and increases the weight of texture features to adapt to complex scene changes. At the same time, a classifier cascade pruning mechanism is introduced to prevent irrelevant regions from entering subsequent calculations by rejecting low-confidence regions early. In terms of decision-making mechanism, a multi-level feedback and fusion strategy is adopted. During forward propagation, the cascade decision-making process is followed step by step. The results of the subsequent classifiers can be fed back to the previous stage for optimization. Finally, the outputs of each classifier are fused through weighted voting, and non-maximum suppression (NMS) is used to solve the problem of overlapping detection of multiple classifiers.
[0126] In the improved AdaBoost classifier, the process of extracting obstacle candidate regions from shadow areas mainly consists of two stages. First, the scene classifier (G1) performs coarse-grained classification on the input image, determining the scene type and extracting the region of interest (ROI). Next, the shadow classifier (Gx) detects shadow regions within the ROI, generating shadow region masks using features such as color and texture. Simultaneously, the obstacle classifier (Gn) combines features such as geometric shape and depth information to detect obstacle candidate regions. Finally, through region intersection analysis and confidence fusion, the shadow regions and obstacle candidate regions are merged, redundant detections are removed, and non-maximum suppression (NMS) is used to determine the final obstacle candidate regions.
[0127] By utilizing binocular stereo vision depth estimation, depth information is extracted using features at different levels.
[0128] Low-level features (grayscale values): Convert the stereo image to a grayscale image and extract the grayscale value of each pixel as a low-level feature.
[0129] ,
[0130] in, It is the gray value at pixel (x,y); , , They are x , y , z Direction of light; It is the magnitude of the light direction vector.
[0131] Mid-level features (texture information): Each pixel and its neighborhood in the image are binary encoded to extract texture information as features; rotation-invariant CLBP with multiple window sizes (3×3, 5×5, 7×7) is used to form multi-scale texture features.
[0132] ,
[0133] ,
[0134] in, It is the local standard deviation, used to enhance the robustness of noise; It is a set of features of multi-scale rotation-invariant local binary patterns. In the entire grayscale image Above, CLBP features are calculated using a window size of s×s, where s is the window size of CLBP. It is the local binary code value at pixel (x,y), reflecting the contrast relationship between this point and its neighboring pixels. It is a binary function. It is the grayscale value of the center pixel (x, y). It is the grayscale value of the neighboring pixel (u,v).
[0135] High-level features (obstacle shape, key points): Using the data from step two, the image after shadow segmentation is divided into multiple regions, and features such as shape, texture, and color of each region are extracted to represent high-level information.
[0136] Step 4: Introduce a similarity-based attention mechanism to fuse the matching results of Haar features and depth estimation features to obtain the fused feature vector.
[0137] Attention mechanisms can highlight important features and suppress irrelevant features, thereby improving the effect of feature fusion. The fused feature vectors are paired with their corresponding labels to form a training dataset. To ensure scale consistency among different features, the training data is normalized. Then, the training dataset is divided into a training set and a validation set. The fused object detection model is trained using the training set, and its performance is evaluated using the validation set, ensuring the model's robustness and accuracy in different scenarios.
[0138] The specific steps for depth estimation of objects in a scene using binocular stereo vision are as follows.
[0139] The extracted Haar feature subsets and depth estimation feature subsets are independently converted into vector form.
[0140] Haar low-level eigenvector representation (intermediate and high-level representations are similar):
[0141] ,
[0142] in, It is a feature vector formed by transforming a subset of Haar features; It is the first Haar low-level eigenvalue. It is the Nth Haar low-level eigenvalue;
[0143] In depth estimation, low-level features are represented by pixel value vectors, which arrange the pixel values of the image into a vector in a predetermined order, with each pixel value serving as a dimension of the vector.
[0144] The low-level feature vector of depth estimation is represented as:
[0145] ,
[0146] in, It is a vector composed of low-level pixel features extracted by the depth estimation task; It is the feature value of the first pixel position. It is the feature value at the Lth pixel position;
[0147] Intermediate features are represented by statistical feature vectors. Texture information is extracted using local binary mode, and the LBP histogram of each local region is used as one dimension of the vector.
[0148] ,
[0149] Where s() is the sign function, if The value is 1 if it is 1, otherwise it is 0. It is the coordinate offset of the i-th pixel in the neighborhood relative to the center pixel.
[0150] High-level features are represented by region descriptors, which are then combined into a vector. For each region segmented from the image, the area, perimeter, circularity, color histogram, and color moments of each region are calculated. Region descriptors are used to represent their features.
[0151] ,
[0152] in It is the number of pixels with an LBP value of i.
[0153] Define the complete set of Haar features The complete set of deep features :
[0154] ,
[0155] The extracted low, medium, and high-level Haar features and the depth-estimated hierarchical features are divided into multiple subsets; each subset represents a specific feature.
[0156] Will Divide into K subsets, each subset corresponding to a local feature or semantic region:
[0157] ,
[0158] Similarly, Divide into K subsets
[0159] ,
[0160] The dimensions of the Haar feature vector and the hierarchical feature vector of the depth estimation are matched. If the dimensions of the two feature vectors are inconsistent, the dimensions are made up by adding extra zero elements to the vector with smaller dimensions.
[0161] ,
[0162] if ,but:
[0163] ,
[0164] if ,but:
[0165] ,
[0166] Each subset is assigned to a multi-core CPU system for parallel feature matching computation. Each CPU core independently performs computation on its own subset. The matching process uses depth estimation results to assist in matching depth-estimated features with Haar features, employing depth-nearest neighbor matching.
[0167] ,
[0168] in, For the i-th depth feature subset; Let i be the i-th Haar feature subset.
[0169] Based on the pixel coordinates and corresponding depth information of the matched feature point pairs in the image, the consistency between the depth estimation feature subset and its corresponding subset is verified by calculating the difference in depth values of the matched feature point pairs. Let the depth value of the depth estimation feature subset be D1, and the depth value of the Haar feature subset be D2.
[0170] ,
[0171] Where C is the confidence level of the depth estimation.
[0172] Depth information provides the relative positional relationships of objects, which is used to constrain the spatial range of nearest neighbor search. Triangulation is used to calculate the distance between matching points, and a threshold is set to filter out the closest matching point pairs (A, B).
[0173] ,
[0174] Set threshold Filter out matching point pairs that are close in distance:
[0175] ,
[0176] For a depth-estimated feature A, calculate its Euclidean distance to all candidate Haar feature vectors B, and select the one with the smallest distance as the nearest neighbor match.
[0177] ,
[0178] in, It is the feature vector of feature A in depth estimation; It is the eigenvector of candidate Haar eigenvector B.
[0179] Binocular stereo vision depth estimation estimates the distance to objects in a scene. First, it receives pre-estimated lighting conditions for illumination compensation. Then, it calibrates both the left and right cameras. Next, it calculates the disparity: disparity = pixel coordinates in the left image - corresponding pixel coordinates in the right image. Finally, it converts the disparity to depth by first converting the disparity and image coordinates to camera coordinates, and then further converting them to world coordinates to obtain the depth value in world coordinates. The specific formula after illumination compensation is as follows:
[0180] ,
[0181] ,
[0182] Where B is the baseline distance; L is the focal length; d is the parallax; and H is an empirical coefficient used to adjust the intensity of illumination compensation. As the light compensation factor, This represents the average grayscale value of the input image.
[0183] Step 5: Threshold the fused feature vectors to obtain depth estimation results. Reconstruct the 3D model of the obstacle based on the depth estimation results, and combine it with GPS to identify obstacles on highways and expressways.
[0184] Thresholding is performed, and the Huber loss function is calculated to optimize the model's robustness and stability. The Huber loss function combines the advantages of mean squared error and absolute error, exhibiting good robustness in complex environments. Thresholding The standard deviation is set to h × std, where h is an empirical adjustment factor that can be adjusted according to actual conditions. Finally, the obstacle is reconstructed into a 3D model based on the depth estimation results. 3D reconstruction can accurately describe and locate the position and specific information of obstacles on highways, providing reliable data support for subsequent decision-making and processing.
[0185] Input deep feature vector and its corresponding Haar feature vector All dimensions are K; similarity measures are used to calculate... and Calculate the similarity or distance between them; obtain a similarity value sim; convert the similarity value sim into attention weights, representing... and The similarity between the two values is analyzed; the softmax function is used to convert the similarity values into a probability distribution; the attention weight is set as attention; and the similarity is calculated. and We perform weighted parallel fusion, using attention weights as weight coefficients, to obtain the fused feature vector. The feature vectors are concatenated to obtain the fused feature vector.
[0186] ,
[0187] in, It is the j-th similarity value. It is the feature vector after feature fusion.
[0188] Obtain the depth map. First, map the depth value from the object coordinate system (or camera coordinate system) to the normalized device coordinate system. Then, based on the camera's intrinsic parameters, map the normalized device coordinates back to the pixel coordinate system. Orthogonal projection is used throughout to reduce computational load. Map the classification results to the depth map. Since the output is a label, label mapping is used.
[0189] ,
[0190] The range of the label values is: The range of pixel values in the depth map is For each pixel (x, y) in the depth map, the corresponding classifier output label is: , These are the pixel values of the mapped depth map.
[0191] Based on the depth estimation results after threshold processing, the obstacle is reconstructed into a 3D model.
[0192] Convert each pixel location in the depth image to a normalized image plane coordinate system.
[0193] ,
[0194] ,
[0195] Where (x, y) are the pixel coordinates in the depth image. These are the coordinates of the principal point of the image. and It refers to the camera's horizontal and vertical focal lengths.
[0196] Using depth values and their corresponding normalized image plane coordinates, 3D point cloud data is generated. Mapping the depth values and corresponding image coordinates to a 3D coordinate system yields the 3D position of each point.
[0197] ,
[0198] ,
[0199] ,
[0200] The 3D coordinates (X, Y, Z) of each point are combined to form a point cloud dataset. Since the depth estimation results have already been thresholded, the filtering to remove outliers is omitted.
[0201] The points in the point cloud are connected to form triangular patches, creating a continuous 3D mesh model. The generated 3D mesh model is then optimized, including smoothing, to complete the 3D reconstruction.
[0202] Finally, with the help of the GPS system, the location and specific information of obstacles in the expressway are determined, including the type of obstacle and its geometric data.
[0203] Example 2
[0204] Based on the highway obstacle recognition method based on the concatenation of Haar features and depth estimation features described in Embodiment 1, this embodiment provides a highway obstacle recognition system based on the concatenation of Haar features and depth estimation features, including a storage medium and a processor; the storage medium is used to store instructions; the processor is used to operate according to the instructions to execute the highway obstacle recognition method based on the concatenation of Haar features and depth estimation features described in Embodiment 1.
[0205] The above description is only a preferred embodiment of the present invention. It should be noted that for those skilled in the art, several improvements and modifications can be made without departing from the technical principles of the present invention, and these improvements and modifications should also be considered within the scope of protection of the present invention.
Claims
1. A method for identifying obstacles on expressways, characterized in that, include: Obtain road images and corresponding depth data from vehicles traveling on highways and expressways; Depth features of road images are obtained from depth data, and shadow segmentation is performed on the acquired road images in combination with the depth features of road images to obtain segmented images; Haar features and depth estimation features are extracted from the segmented image and then matched in parallel. A similarity-based attention mechanism is introduced to fuse the matching results of Haar features and depth estimation features to obtain a fused feature vector. Thresholding is applied to the fused feature vectors to obtain depth estimation results. Based on the depth estimation results, a 3D model of the obstacle is reconstructed, and GPS is used to identify obstacles on expressways. The process of shadow segmentation in the acquired road images includes: The scene lighting conditions are estimated using depth data. A shadow mapping model is built based on the described lighting conditions to determine the shadow content of the image and analyze the shadow information to distinguish between objects and shadows. Shadow regions have different depth features. By combining the depth information with the image, the shadow region is segmented. When performing actual rendering, the Shadow Mapping model uses shadow maps to determine whether a pixel is in shadow. It transforms the current pixel from the world coordinate system to the clip space coordinate system of the light source's view and obtains the corresponding depth value in the shadow map using the corresponding coordinates. Then, it compares the depth value of the current pixel with the depth value in the shadow map. If the depth value of the current pixel is greater than the depth value in the shadow map, it means that the pixel is in shadow. Extracting depth estimation features, including: By utilizing binocular vision depth estimation, depth information is extracted using features at different levels; Low-level features: Convert the binocular image to a grayscale image and extract the grayscale value of each pixel as a low-level feature; Mid-level features: Binary encoding is performed on each pixel and its neighborhood in the image to extract texture information as features; rotation-invariant CLBP with multiple window sizes of 3×3, 5×5, and 7×7 is used to form multi-scale texture features; High-level features: The image after shadow segmentation is divided into multiple regions, and features such as shape, texture and color of each region are extracted to represent high-level information.
2. The method for identifying obstacles on expressways according to claim 1, characterized in that, Estimating scene lighting conditions using depth data includes: Use a depth sensor to acquire depth images of the scene; Gaussian smoothing is introduced to smooth the normals, and an adaptive Gaussian kernel is introduced to dynamically adjust the normals. Parameters are used to balance noise suppression and edge preservation; a Gaussian kernel is applied to each pixel location of the depth image: , in, It is the standard deviation of the basic Gaussian filter; k is the proportional coefficient, which controls... Sensitivity to gradients; It is the magnitude of the image gradient, calculated using the Sobel operator; It represents the standard deviation of the Gaussian kernel at pixel position (u,v); For each pixel, perform depth image 3D gradient and normalization calculation: , in, It is the normalized normal vector; D is the depth image; , These are the rates of change of the depth value in the x and y directions, respectively; Calculate the plane fitting normal, combine the RANSAC algorithm and the least squares method to fit the plane, remove outliers, and estimate the normal; , in, ... , These are the two-dimensional pixel coordinates of the i-th point. It is the depth value of the i-th point; A multi-feature fusion method is used to detect shadow regions in an image, utilizing feature vectors. It integrates depth features, normal features, intensity features, and RGB color features; , in, It is the normal vector Direction angle; It is the intensity information of pixel i; , , These are the RGB channel color information of pixel i; It is a residual; It is a combined feature vector used to describe the multimodal features of pixel i; , in, is the unit normal vector at pixel i; L is the illumination direction; k is an empirical coefficient used to adjust the intensity of illumination compensation; It is ambient light; In the optimization objective of illumination estimation, in addition to considering pixels in non-shaded areas, feature vectors in shaded areas are also considered. The feature vectors of shaded areas are compared with those of non-shaded areas to improve the accuracy of illumination estimation, thus deriving the optimization objective of illumination estimation: , in, It is the average value of the feature vectors of the non-shaded regions.
3. The method for identifying obstacles on expressways according to claim 2, characterized in that, The low-level features are expressed as follows: , in, It is the gray value at pixel (x,y); , , They are x , y , z Direction of light; It is the magnitude of the light direction vector; The expression for the mid-level feature is: , , in, It is the local standard deviation, used to enhance the robustness of noise; It is a set of features of multi-scale rotation-invariant local binary patterns. In the entire grayscale image Above, CLBP features are calculated using a window size of s×s, where s is the window size of CLBP. It is the local binary code value at pixel (x,y), reflecting the contrast relationship between this point and its neighboring pixels. It is a binary function. It is the grayscale value of the center pixel (x, y). It is the grayscale value of the neighboring pixel (u,v).
4. The method for identifying obstacles on expressways according to claim 3, characterized in that, Extracting Haar features, including: First, perform HOI cropping on the image to remove the non-road areas on both sides; Using the extended Haar feature, by sliding across the image, the pixel covered by the white area minus the pixel covered by the black area is its feature response value; In obstacle recognition, multiple Haar features at different locations and scales are weighted and averaged to capture multiple local features of the obstacle and obtain the final feature fusion response value. The feature fusion response value is input into the AdaBoost classifier to obtain the candidate region of the obstacle.
5. The method for identifying obstacles on expressways according to claim 4, characterized in that, The matching results of Haar features and depth estimation features are fused, including: The extracted subsets of Haar features and the subsets of depth-estimated features are independently converted into vector form; The low-level eigenvectors of Haar are represented as follows: , in, It is a feature vector formed by transforming a subset of Haar features; It is the first Haar low-level eigenvalue. It is the Nth Haar low-level eigenvalue; In depth estimation, low-level features are represented by pixel value vectors, which arrange the pixel values of the image into a vector in a set order, with each pixel value serving as a dimension of the vector. The low-level feature vector of depth estimation is represented as: , in, It is a vector composed of low-level pixel features extracted by the depth estimation task; It is the feature value of the first pixel position. It is the feature value at the Lth pixel position; Intermediate features are represented by statistical feature vectors. Texture information is extracted using local binary mode, and the LBP histogram of each local region is used as one dimension of the vector. High-level features are represented by region descriptors, which are then combined into a vector. For each region segmented from an image, the area, perimeter, circularity, color histogram, and color moments of each region are calculated to represent its features using region descriptors. Define the complete set of Haar features The complete set of deep features : , The extracted low, medium, and high-level Haar features and the depth-estimated hierarchical features are divided into multiple subsets; each subset represents a specific feature. Will Divide into K subsets, each subset corresponding to a local feature or semantic region: , Similarly, Divide into K subsets , The dimensions of the Haar feature vector and the hierarchical feature vector of the depth estimation are matched. If the dimensions of the two feature vectors are inconsistent, the dimensions are made up by adding extra zero elements to the vector with smaller dimensions. Parallel feature matching computation is performed on each subset; the depth estimation results are used to assist the process of matching depth-estimated features and Haar features, employing depth-nearest neighbor matching. , in, For the i-th depth feature subset; For the i-th Haar feature subset; Let the depth of the depth estimation feature subset be D1, and the depth of the Haar feature subset be D2. , Where C is the confidence level of the depth estimate; Triangulation is used to calculate the distance between matching points. A threshold is set to filter out the closest matching point pairs (A, B). For depth estimation feature A, the one with the smallest distance is selected as the nearest neighbor match.
6. The method for identifying obstacles on expressways according to claim 5, characterized in that, Calculate the distance between matching points, including: First, pre-estimated lighting condition data is received to perform lighting compensation and calibrate the camera. Then, parallax is calculated. Finally, the parallax is converted to depth. First, the parallax and image coordinates are converted to coordinates in the camera coordinate system, and then further converted to coordinates in the world coordinate system to obtain the depth value in the world coordinate system. After lighting compensation, the specific formula is as follows: , , Where B is the baseline distance; L is the focal length; and d is the parallax. As the light compensation factor, This represents the average grayscale value of the input image.
7. The method for identifying obstacles on expressways according to claim 6, characterized in that, The fusion of matching results between Haar features and depth estimation features also includes: Input deep feature vector and its corresponding Haar feature vector All dimensions are K; similarity measures are used to calculate... and Calculate the similarity or distance between them; obtain a similarity value sim; convert the similarity value sim into attention weights, representing... and The similarity between the two values is analyzed; the softmax function is used to convert the similarity values into a probability distribution; the attention weight is set as attention; and the similarity is calculated. and We perform weighted parallel fusion, using attention weights as weight coefficients, to obtain the fused feature vector. The feature vectors are concatenated to obtain the fused feature vector. , in, It is the j-th similarity value. It is the feature vector after feature fusion.
8. The method for identifying obstacles on expressways according to claim 7, characterized in that, Based on the depth estimation results, a 3D model of the obstacle is reconstructed, including: To obtain a depth map, the depth values are first mapped from the object coordinate system or camera coordinate system to the normalized device coordinate system; then, based on the camera's intrinsic parameters, the normalized device coordinates are mapped to the pixel coordinate system. , The range of the label values is: The range of pixel values in the depth map is For each pixel (x, y) in the depth map, the corresponding classifier output label is: , These are the pixel values of the mapped depth map; Each pixel position in the depth image is converted into a normalized image plane coordinate system. Using the depth value and the corresponding normalized image plane coordinates, 3D point cloud data is generated. The depth value and the corresponding image coordinates are mapped to the 3D coordinate system to obtain the 3D position of each point. The 3D coordinates of each point are combined into a point cloud dataset. The points in the point cloud are connected into triangular patches to form a continuous 3D mesh model. The generated 3D mesh model is optimized to complete the 3D reconstruction.
9. A highway obstacle recognition system, characterized in that, Including storage media and processor; The storage medium is used to store instructions; The processor is configured to operate according to the instructions to execute the high-speed road obstacle recognition method according to any one of claims 1 to 8.