Methods, devices, and storage media for pedestrian intent detection based on pedestrian characteristics and human-vehicle interaction.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SOUTHEAST UNIV
- Filing Date
- 2024-04-02
- Publication Date
- 2026-06-30
Smart Images

Figure CN118298227B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of pedestrian crossing intention detection, and in particular to a method, device and storage medium for pedestrian intention detection based on pedestrian characteristics and human-vehicle interaction. Background Technology
[0002] Predicting pedestrian crossing intentions is fundamental to improving pedestrian safety in transportation engineering, especially in autonomous driving, and its research has significant practical implications. In recent years, deep learning-based pedestrian crossing intention recognition models have become mainstream due to their high accuracy and ability to cover numerous environmental factors. However, existing models primarily consider the surrounding environment. While environmental features can reflect pedestrian crossing intentions, the information they provide lacks universality and exhibits significant individual variation. Furthermore, vehicle status has a greater influence on pedestrian crossing intentions. Limited research has focused on human-vehicle interaction, which explains the relatively low accuracy of current pedestrian crossing intention recognition methods. However, a large dataset of LiDAR data provides a data foundation for research on human-vehicle interaction.
[0003] For example, the paper "Pedestrians crossing intention anticipation based on dual-channel action recognition and hierarchical environmental context" proposes a multi-factor fusion network (MFFN) to predict pedestrians' crossing intentions. It includes a dual-channel action recognition sub-network that robustly recognizes pedestrian actions by adaptively fusing skeletal and appearance features. Then, a hierarchical attention network and lightweight semantic segmentation method are used to achieve object-level and semantic-level perception of the traffic scene. Finally, a self-attention mechanism is used to integrate various factors to predict pedestrians' crossing intentions. However, it does not consider the human-vehicle interaction process in traffic scenes. Similarly, the paper "Crossing or Not? Context-Based Recognition of Pedestrian CrossingIntention in the Urban Environment" proposes a pedestrian crossing intention recognition (PCIR) framework. It introduces a target of interest search module to find pedestrians who may be crossing the road, while simultaneously performing scene perception. The action recognition module uses a 3D convolutional neural network to extract spatiotemporal features. Distance encoding is added to encode the distance between pedestrians and vehicles and the local traffic scene around the pedestrian, improving recognition accuracy. This demonstrates the robustness of the PCIR framework compared to methods based solely on skeletons. However, simply encoding the pedestrian-vehicle distance still cannot reflect the pedestrian-vehicle interaction process. More factors influencing pedestrians' crossing intentions need to be introduced to better predict pedestrians' crossing intentions.
[0004] While research on pedestrian crossing intention recognition has made some progress, the following shortcomings still exist:
[0005] 1. Most existing studies are based on pedestrians’ own motion characteristics and environmental features, such as pedestrian position or pedestrian posture characteristics, without considering the pedestrian’s understanding of the human-vehicle interaction process during the crossing of the street, and lack modeling and analysis of the human-vehicle interaction process.
[0006] 2. Most existing pedestrian crossing intention recognition methods lack an understanding of global features and cannot make predictions about pedestrian crossing intentions by understanding global features through vehicle-mounted video, which makes it difficult to improve the accuracy of intention prediction. Summary of the Invention
[0007] The purpose of this invention is to provide a method, device, and storage medium for pedestrian intent detection based on pedestrian characteristics and human-vehicle interaction.
[0008] The objective of this invention can be achieved through the following technical solutions:
[0009] A pedestrian intent detection method based on pedestrian characteristics and human-vehicle interaction includes:
[0010] Acquire driving recording video from vehicle-mounted cameras;
[0011] The process involves: performing person detection on dashcam video to obtain pedestrian detection results; extracting pedestrian key points from the results and saving a key point heatmap; determining the speed of each pedestrian based on the extracted key points; inputting the extracted key points into a first feature extraction network to obtain a first behavioral feature representing the classification of each pedestrian's motion state; inputting the key point heatmap into a second feature extraction network to obtain a second behavioral feature; and concatenating the first and second behavioral features to obtain the pedestrian behavioral feature.
[0012] Scene segmentation and depth estimation are performed on the driving record video to obtain scene segmentation map and depth estimation map. The scene segmentation map and depth estimation map are then concatenated and input into the third feature extraction network. Combined with vehicle speed, the human-vehicle interaction features of each frame of the driving record video are obtained.
[0013] The pedestrian detection results, pedestrian speed, pedestrian behavior features, and human-vehicle interaction features of the corresponding frames are input into the intent classification network to obtain the pedestrian's intent to cross the road detection results.
[0014] The vehicle-mounted camera equipment includes one or more vehicle-mounted camera devices, and each vehicle-mounted camera device includes at least one camera.
[0015] The process of performing person detection on the dashcam video to obtain pedestrian detection results specifically includes:
[0016] Human detection is performed on each frame of the dashcam video to obtain the pedestrians contained in each frame and the positional features of each pedestrian;
[0017] The pedestrians detected in each frame are tracked, and the pedestrians contained in each frame are merged to obtain the trajectory of each pedestrian.
[0018] The location feature of the pedestrian is specifically the bounding rectangle of the pedestrian.
[0019] The speed of each pedestrian is determined based on the extracted pedestrian key points, specifically including:
[0020] Based on the left and right waist feature points of the pedestrian, the midpoint position of the waist of the left and right waist feature points is calculated;
[0021] The pedestrian's speed is obtained based on the position of the pedestrian's waist midpoint in at least two frames before and after.
[0022] The pedestrian movement status classification results include standing, walking, and running.
[0023] The Hrnet network is specifically used in the process of extracting pedestrian key points.
[0024] The intent classification network includes a first GRU module, a second GRU module, a third GRU module, a fourth GRU module, a first attention module, a second attention module, a third attention module, and a fully connected layer;
[0025] The process of inputting pedestrian detection results, pedestrian speed, pedestrian behavior features, and corresponding frame human-vehicle interaction features into an intent classification network to obtain pedestrian crossing intent detection results includes:
[0026] The pedestrian behavior features are input into the first GRU module, and the output of the first GRU module is obtained.
[0027] The output of the first GRU module and the pedestrian's speed are input into the second GRU module to obtain the output of the second GRU module;
[0028] The output of the second GRU module is combined with the pedestrian's position features and input into the third GRU module to obtain the output of the third GRU module.
[0029] The output of the third GRU module is input into the first attention module to obtain the first intent feature vector;
[0030] The human-vehicle interaction features are input into the fourth GRU module to obtain the output of the fourth GRU module;
[0031] The output of the fourth GRU module is input into the second attention module to obtain the second intent feature vector;
[0032] The first intent feature vector and the second intent feature vector are concatenated and then passed through the third attention module and the fully connected layer in sequence to obtain the pedestrian crossing intent detection result.
[0033] A pedestrian intent detection device based on pedestrian characteristics and human-vehicle interaction includes a memory, a processor, and a program stored in the memory, characterized in that the processor implements the method described above when executing the program.
[0034] A storage medium having a program stored thereon, which, when executed, implements the method described above.
[0035] Compared with the prior art, the present invention has the following beneficial effects:
[0036] 1. By integrating a wider range of pedestrian and environmental data, it is possible to better predict pedestrians' crossing intentions and effectively improve pedestrian safety in traffic scenarios.
[0037] 2. In current traffic scenarios, the frequency of autonomous driving use is also increasing. Studying pedestrians' crossing intentions can help anticipate their behavior and improve the riding experience of autonomous driving. Attached Figure Description
[0038] Figure 1 This is a schematic diagram of the main steps of the method of the present invention;
[0039] Figure 2 A schematic diagram of the results of pedestrian key point extraction;
[0040] Figure 3 A schematic diagram illustrating the process of obtaining pedestrian behavioral characteristics;
[0041] Figure 4 This is a schematic diagram of a scene segmentation map;
[0042] Figure 5 This is a schematic diagram of a depth estimation map. Detailed Implementation
[0043] The present invention will now be described in detail with reference to the accompanying drawings and specific embodiments. These embodiments are based on the technical solution of the present invention and provide detailed implementation methods and specific operating procedures. However, the scope of protection of the present invention is not limited to the following embodiments.
[0044] A pedestrian intent detection method based on pedestrian characteristics and human-vehicle interaction, such as... Figure 1 As shown, it includes:
[0045] (a) Obtaining driving recording video from vehicle-mounted cameras;
[0046] The vehicle-mounted camera equipment includes one or more vehicle-mounted camera devices, each of which includes at least one camera. For example, in some embodiments, video captured by a vehicle-mounted dashcam can be used, which is easier to obtain. In this embodiment, however, the video obtained is a forward-facing video synthesized from eight vehicle-mounted cameras.
[0047] (II) Obtaining pedestrian behavior characteristics
[0048] (2-1) First, perform person detection on the dashcam video to obtain pedestrian detection results, specifically including:
[0049] Human detection is performed on each frame of the dashcam video to obtain the pedestrians contained in each frame and the positional features of each pedestrian;
[0050] The pedestrians detected in each frame are tracked, and the pedestrians contained in each frame are merged to obtain the trajectory of each pedestrian.
[0051] This process is specifically implemented using the Yolo v5+Byte technical framework. A pedestrian detection framework based on Yolo v5 is built. To reduce the impact of pedestrian distortion from a vehicle-view perspective, the Yolo v5's own DIOU_Loss loss function is improved to SIOU_Loss, further considering the vector angle between the ground truth bounding box and the predicted bounding box, outputting the pedestrian bounding box coordinates. Specifically, SIOU_Loss is:
[0052]
[0053]
[0054]
[0055]
[0056]
[0057]
[0058]
[0059]
[0060]
[0061] θ = 4
[0062] Where: L box SIOU_Loss represents the loss function, Λ represents the angle loss, Ω represents the shape loss, Δ represents the distance loss, σ represents the distance between the center of the detection box and the center of the ground truth box, and c w c represents the distance on the x-axis between the smallest bounding box and the detection box. h ρ represents the minimum bounding box distance on the y-axis between the ground truth bounding box and the detection bounding box. x , ρ y Let x be an intermediate quantity, representing the value of sin(α). This represents the coordinates of the center of the true bounding box on the x-axis. This indicates the coordinates of the center of the detection box on the x-axis. b represents the y-coordinate of the center of the true bounding box. cy The coordinates of the detection box center on the y-axis are given. IoU represents the ratio of the intersection to the union of the detection box and the ground truth box areas. w represents the width of the detection box. gt h represents the width of the ground truth, and h represents the height of the detection box.gt This represents the height of the actual bounding box.
[0063] The output pedestrian bounding box is the bounding rectangle of the pedestrian, which can also be used as the pedestrian's positional feature. Inputting the data sequence of the pedestrian bounding box changing over time into the GRU model can effectively capture and model the pedestrian's movement trajectory over time, showing the pedestrian's positional movement in the time dimension, and providing a basis for subsequent pedestrian crossing intention prediction. Of course, in some other embodiments, other methods can also be used to represent the pedestrian's positional features.
[0064] Then, the Kalman filter algorithm is used to estimate the pedestrian state based on the pedestrian position information detected in the current frame and predict the pedestrian position in the next frame. The detection result of the current frame is matched with the prediction result of the Kalman filter for the current frame in the previous time step to determine whether they are the same target.
[0065] The algorithm uses data association to determine whether each pedestrian target belongs to an existing trajectory. High-scoring and low-scoring detection boxes for pedestrians are determined based on high and low score thresholds, respectively. Then, high-quality tracking trajectories are retained or newly created based on the similarity between the detection boxes and the tracking trajectory, and the positional overlap between the detection boxes and the trajectory, while low-scoring target trajectories are identified. Furthermore, in some embodiments, only high-scoring trajectories are used for subsequent pedestrian intent recognition.
[0066] Pedestrian bounding box spatiotemporal data P bb , It consists of the coordinates of its top-left corner [x1, y1] and its bottom-right corner [x2, y2]. The bounding box matrix P bb The dimension is m×4, where m is the observation time, representing the number of frames observed to predict pedestrian intentions. t is defined as the decisive moment 0.5–4 seconds before the intersection event.
[0067] (2-2) Extract pedestrian key points from the obtained pedestrian detection results and save the key point heatmap during the key point extraction process;
[0068] In the process of extracting pedestrian key points, the Hrnet network was specifically used, and the extraction results are as follows: Figure 2 As shown, there are a total of 17 pedestrian key points.
[0069] During the extraction process, a key point heatmap is generated. In the key point heatmap, each key point represents the area of probability distribution of that key point. The pixel value of each point in the area represents the probability of being selected as a key point. The higher the probability, the larger the pixel value.
[0070] Finally, the obtained pedestrian keypoint extraction results can be represented by a pedestrian keypoint coordinate set, where the pedestrian keypoint coordinate set... The dimension is m×34.
[0071] (2-3) Determine the speed of each pedestrian based on the extracted pedestrian key points, specifically including:
[0072] Based on the left and right waist feature points of the pedestrian, the midpoint position of the waist of the left and right waist feature points is calculated;
[0073] The pedestrian's speed is obtained based on the position of the pedestrian's waist midpoint in at least two consecutive frames;
[0074] Among them, the left waist feature points and the right waist feature points are respectively Figure 2 Given the key points numbered 12 and 11, the pedestrian's speed V... t Specifically:
[0075]
[0076] in: Let x be the x-coordinate of the midpoint of the waist in frame t. Let x be the x-axis coordinate of the midpoint of the waist in frame t+1. Let be the y-axis coordinate of the midpoint of the waist in frame t+1. Let y be the y-coordinate of the midpoint of the waist in frame t, and Δt be the time interval between frame t+1 and frame t.
[0077] (2-4) Input the extracted pedestrian key points into the first feature extraction network to obtain the first behavioral feature used to characterize the classification results of each pedestrian's motion state.
[0078] like Figure 3 As shown, the first feature extraction network uses a spatiotemporal graph convolutional neural network, and the resulting pedestrian motion state classification results include standing, walking, and running, using a one-dimensional value.
[0079] (2-5) Input the key point heatmap into the second feature extraction network to obtain the second behavioral features.
[0080] like Figure 3 As shown, the second feature extraction network uses a 3D residual network, and the extracted result is a one-dimensional vector.
[0081] (2-6) The first behavioral feature and the second behavioral feature are concatenated to obtain the pedestrian behavioral feature;
[0082] (III) Scene segmentation and depth estimation are performed on the driving video to obtain scene segmentation map and depth estimation map. The scene segmentation map and depth estimation map are then stitched together and input into the third feature extraction network. Combined with the vehicle speed, the human-vehicle interaction features of each frame of the driving video are obtained.
[0083] Specifically, in this embodiment, firstly, the input video frames are semantically segmented using the segment anything large model to obtain the spatial distribution of moving and static objects in the scene. Then, a scene segmentation map is obtained by representing the global context using pixel-level semantic masks, such as... Figure 4 As shown, denoted as E sc E sc ={sc t-m ,sc t-m+1 ,…,sc t}. E sc The dimensions are m×224×224.
[0084] Next, the depth anything model is used to estimate the depth of all categories in the scene, resulting in the depth estimation image of all objects in the scene that are far from the vehicle. Figure 5 As shown, denoted as E cd E cd ={cd t-m ,cd t -m+1 ,…,cd t}, E cd The dimensions are m×224×224.
[0085] Next, a CNN network was constructed using the VGG19 model pre-trained on the ImageNet dataset. The scene segmentation map and depth estimation map obtained in the previous step were used as input to a 4D array with dimensions [m, 224, 224, 3]. After passing through the CNN network, feature maps of each image were extracted from the fourth max pooling layer of VGG19, with a size of [512, 14, 14].
[0086] Then, each feature map is averaged using a 14×14 kernel pooling layer, followed by flattening and concatenation to generate a feature vector of size [16, 256].
[0087] Finally, the feature vectors based on the global context and global depth heatmap are combined with the vehicle speed to generate human-vehicle interaction features.
[0088] (iv) The pedestrian detection results, pedestrian speed, pedestrian behavior characteristics, and human-vehicle interaction characteristics of the corresponding frame are input into the intent classification network to obtain the pedestrian crossing intent detection results.
[0089] The intent classification network includes a first GRU module, a second GRU module, a third GRU module, a fourth GRU module, a first attention module, a second attention module, a third attention module, and a fully connected layer;
[0090] The pedestrian detection results, pedestrian speed, pedestrian behavior features, and corresponding frame human-vehicle interaction features are input into the intent classification network to obtain the pedestrian crossing intent detection results, including:
[0091] The pedestrian behavior features are input into the first GRU module, and the output of the first GRU module is obtained.
[0092] The output of the first GRU module and the pedestrian's speed are input into the second GRU module to obtain the output of the second GRU module;
[0093] The output of the second GRU module is combined with the pedestrian's position features and input into the third GRU module to obtain the output of the third GRU module.
[0094] The output of the third GRU module is input into the first attention module to obtain the first intent feature vector;
[0095] The human-vehicle interaction features are input into the fourth GRU module to obtain the output of the fourth GRU module;
[0096] The output of the fourth GRU module is input into the second attention module to obtain the second intent feature vector;
[0097] The first intent feature vector and the second intent feature vector are concatenated and then passed through the third attention module and the fully connected layer in sequence to obtain the pedestrian crossing intent detection result.
[0098] Finally, experiments were conducted to validate the results. Approximately 50% (880 samples) of the dataset was selected for training, 40% (719 samples) for testing, and 10% (243 samples) for validation. The dataset included vehicle speed, heading, and GPS coordinates. The overall experimental hyperparameters are shown in Table 1.
[0099] Table 1
[0100]
[0101] A comparative experiment was conducted, with Comparative Example 1 using only pedestrian behavior features for detection and Comparative Example 2 using only human-vehicle interaction features for detection. The results are shown in Table 2.
[0102] Table 2
[0103]
[0104] The results show that pedestrian behavior depth features and human-vehicle interaction features both influence pedestrians' intention to cross the street.
[0105] If the aforementioned functions are implemented as software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this invention, or the part that contributes to the prior art, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
Claims
1. A method for pedestrian intent detection based on pedestrian characteristics and human-vehicle interaction, characterized in that, include: Acquire driving recording video from vehicle-mounted cameras; Human detection is performed on dashcam video to obtain pedestrian detection results; Extract pedestrian key points from the obtained pedestrian detection results and save the key point heatmap during the key point extraction process; The speed of each pedestrian is determined based on the extracted pedestrian key points; The extracted pedestrian key points are input into the first feature extraction network to obtain the first behavioral feature used to characterize the classification results of each pedestrian's motion state; The key point heatmap is input into the second feature extraction network to obtain the second behavioral feature; the first behavioral feature and the second behavioral feature are concatenated to obtain the pedestrian behavioral feature. Scene segmentation and depth estimation are performed on the driving record video to obtain scene segmentation map and depth estimation map. The scene segmentation map and depth estimation map are then concatenated and input into the third feature extraction network. Combined with vehicle speed, the human-vehicle interaction features of each frame of the driving record video are obtained. The pedestrian detection results, pedestrian speed, pedestrian behavior features, and human-vehicle interaction features of the corresponding frames are input into the intent classification network to obtain the pedestrian's intent to cross the road detection results.
2. The pedestrian intent detection method based on pedestrian characteristics and human-vehicle interaction according to claim 1, characterized in that, The vehicle-mounted camera equipment includes one or more vehicle-mounted camera devices, and each vehicle-mounted camera device includes at least one camera.
3. The pedestrian intent detection method based on pedestrian characteristics and human-vehicle interaction according to claim 1, characterized in that, The process of performing person detection on the dashcam video to obtain pedestrian detection results specifically includes: Human detection is performed on each frame of the dashcam video to obtain the pedestrians contained in each frame and the positional features of each pedestrian; The pedestrians detected in each frame are tracked, and the pedestrians contained in each frame are merged to obtain the trajectory of each pedestrian.
4. The pedestrian intent detection method based on pedestrian characteristics and human-vehicle interaction according to claim 3, characterized in that, The location feature of the pedestrian is specifically the bounding rectangle of the pedestrian.
5. The pedestrian intent detection method based on pedestrian characteristics and human-vehicle interaction according to claim 1, characterized in that, The speed of each pedestrian is determined based on the extracted pedestrian key points, specifically including: Based on the left and right waist feature points of the pedestrian, the midpoint position of the waist of the left and right waist feature points is calculated; The pedestrian's speed is obtained based on the position of the pedestrian's waist midpoint in at least two frames before and after.
6. The pedestrian intent detection method based on pedestrian characteristics and human-vehicle interaction according to claim 1, characterized in that, The pedestrian movement status classification results include standing, walking, and running.
7. The pedestrian intent detection method based on pedestrian characteristics and human-vehicle interaction according to claim 1, characterized in that, The Hrnet network is specifically used in the process of extracting pedestrian key points.
8. The pedestrian intent detection method based on pedestrian characteristics and human-vehicle interaction according to claim 3, characterized in that, The intent classification network includes a first GRU module, a second GRU module, a third GRU module, a fourth GRU module, a first attention module, a second attention module, a third attention module, and a fully connected layer; The process of inputting pedestrian detection results, pedestrian speed, pedestrian behavior features, and corresponding frame human-vehicle interaction features into an intent classification network to obtain pedestrian crossing intent detection results includes: The pedestrian behavior features are input into the first GRU module, and the output of the first GRU module is obtained. The output of the first GRU module and the pedestrian's speed are input into the second GRU module to obtain the output of the second GRU module; The output of the second GRU module is combined with the pedestrian's position features and input into the third GRU module to obtain the output of the third GRU module. The output of the third GRU module is input into the first attention module to obtain the first intent feature vector; The human-vehicle interaction features are input into the fourth GRU module to obtain the output of the fourth GRU module; The output of the fourth GRU module is input into the second attention module to obtain the second intent feature vector; The first intent feature vector and the second intent feature vector are concatenated and then passed through the third attention module and the fully connected layer in sequence to obtain the pedestrian crossing intent detection result.
9. A pedestrian intent detection device based on pedestrian characteristics and human-vehicle interaction, comprising a memory, a processor, and a program stored in the memory, characterized in that, When the processor executes the program, it implements the method as described in any one of claims 1-8.
10. A storage medium having a program stored thereon, characterized in that, When the program is executed, it implements the method as described in any one of claims 1-8.