Human action recognition method and device, storage medium and vehicle

By combining RGB acquisition devices and depth cameras to acquire images in the vehicle, and then performing fusion processing and deep neural network analysis, the problem of the limited number of sensors in the cabin was solved. This enabled accurate human motion recognition and interactive modeling under both normal and low light conditions, thus improving the accuracy of detection.

CN115578720BActive Publication Date: 2026-06-23ZHEJIANG ZEEKR INTELLIGENT TECH CO LTD +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
ZHEJIANG ZEEKR INTELLIGENT TECH CO LTD
Filing Date
2022-11-08
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Given the limited space inside the vehicle cabin and the limited number of sensors, existing technologies cannot accurately identify human movements under normal and low light conditions. Furthermore, the cabin sensors cannot operate in low light environments, resulting in low detection accuracy and an inability to effectively define the interaction between humans and objects.

Method used

The method combines RGB acquisition devices and depth cameras to acquire RGB images and depth images. The images are then fused to obtain a fused image. A deep neural network is used for target detection and human skeleton key point recognition to obtain the three-dimensional coordinate information of the detected target and human skeleton key points. The relative position information is then combined to identify the human action category.

Benefits of technology

It can accurately detect human movement categories under both normal and low light conditions, improving detection accuracy, and enhances detection reliability through human-object interaction modeling.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN115578720B_ABST
    Figure CN115578720B_ABST
Patent Text Reader

Abstract

The application discloses a human action recognition method and device, a storage medium and a vehicle. The method is applied to the vehicle, and comprises the following steps: acquiring an RGB image and a depth image; performing fusion processing on the RGB image and the depth image to obtain a fusion image; performing target detection on the fusion image to obtain detection frame information of a detection target, wherein the detection target comprises a human body and an object; obtaining three-dimensional coordinate information of the detection target according to the detection frame information and the depth image; performing human skeleton key point detection on the depth image to obtain three-dimensional coordinate information of human skeleton key points; and recognizing a human action category according to the three-dimensional coordinate information of the detection target and the three-dimensional coordinate information of the human skeleton key points. The human action recognition method can accurately detect the human action category under normal light and dark light conditions, and the detection accuracy is improved by using human and object interaction modeling.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of vehicle technology, and in particular to a method, apparatus, storage medium, and vehicle for human motion recognition. Background Technology

[0002] Human motion recognition has a wide range of applications, such as in car cabins, including behavior recognition and hazardous action detection. However, in cabin environments, accurately recognizing postures and detecting objects using a limited number of sensors has become a major challenge for the industry.

[0003] Action recognition is a technique for identifying different actions or behaviors from a video sequence input; it can be viewed as a classification task of video sequences. Action recognition methods are divided into traditional methods and deep learning methods. Traditional methods use manually designed computer vision algorithms to extract features from the video, and after processing, traditional machine learning classifiers, such as SVM (Support Vector Machine), are used to classify the extracted features to obtain the final action recognition result. Deep learning-based methods include single-stream methods, two-stream methods, and skeleton-based action recognition methods. Single-stream methods take continuous RGB color video frames as input, two-stream methods take one temporal stream and one spatial stream as input, and skeleton-based methods take human skeleton coordinates as input.

[0004] In related technologies, deep learning-based methods are the mainstream. For example, patent CN113486759A, "Method and Device for Identifying Dangerous Actions, Electronic Equipment and Storage Medium," employs a single-stream method based on deep learning. It classifies each frame of RGB image data collected inside the vehicle cabin to identify dangerous actions within the cabin. Patent CN110399767A, "Method and Device for Identifying Dangerous Actions of Personnel Inside a Vehicle, Electronic Equipment and Storage Medium," also uses a single-stream method based on deep learning. It divides each frame of image data in the video stream into regions, extracts features such as objects and body parts within specific regions, and finally classifies actions according to preset rules or correspondences to determine whether they are dangerous actions.

[0005] The drawbacks of the above methods are that, given the limited space inside the vehicle cabin and the restricted number of sensors, they cannot effectively model people or objects. Furthermore, ordinary cameras inside the cabin cannot operate in low-light conditions. In-cabin 2D cameras cannot accurately perceive the relative positions of people or objects within the cabin, resulting in low accuracy. The interaction between people and objects inside the cabin is also poorly defined. Summary of the Invention

[0006] One objective of this invention is to propose a method for human motion recognition that can accurately detect human motion categories under both normal and low light conditions, and improves detection accuracy by utilizing human-object interaction modeling.

[0007] To achieve the above objectives, a first aspect of the present invention provides a method for human action recognition, applied to a vehicle. The method includes: acquiring an RGB image and a depth image; fusing the RGB image and the depth image to obtain a fused image; performing target detection on the fused image to obtain detection bounding box information of the detected targets, wherein the detected targets include human bodies and objects; obtaining three-dimensional coordinate information of the detected targets based on the detection bounding box information and the depth image; performing human skeletal keypoint detection on the depth image to obtain three-dimensional coordinate information of the human skeletal keypoints; and identifying the human action category based on the three-dimensional coordinate information of the detected targets and the three-dimensional coordinate information of the human skeletal keypoints.

[0008] In addition, the human motion recognition method proposed in the above embodiments of the present invention may also have the following additional technical features:

[0009] According to an embodiment of the present invention, the fusion processing of the RGB image and the depth image includes: obtaining the intrinsic and extrinsic parameters of the RGB acquisition device to obtain a first intrinsic and extrinsic parameter, and obtaining the intrinsic and extrinsic parameters of the depth camera to obtain a second intrinsic and extrinsic parameter; obtaining the geometric positional relationship between the RGB acquisition device and the depth camera; and performing fusion processing on the RGB image and the depth image according to the first intrinsic and extrinsic parameter, the second intrinsic and extrinsic parameter, and the geometric positional relationship.

[0010] According to one embodiment of the present invention, the detection box information includes two-dimensional coordinate information of the detected target, and the step of obtaining three-dimensional coordinate information of the detected target based on the detection box information and the depth image includes: obtaining the correspondence between the fused image and the depth image; obtaining the depth information of the detected target based on the two-dimensional coordinate information of the detected target and the correspondence; and obtaining the three-dimensional coordinate information of the detected target based on the two-dimensional coordinate information and the depth information.

[0011] According to one embodiment of the present invention, a key point detection model is used to detect human skeleton key points in the depth image. The number of human skeleton key points is N. The key point detection model includes N feature layers. The N feature layers are used to regress the depth information of the key points, where N is an integer greater than 1.

[0012] According to one embodiment of the present invention, the key point detection model further includes a fully connected layer, and the N feature layers are connected to the fully connected layer. The fully connected layer is used to regress the three-dimensional coordinate information of the human skeleton key points based on the depth information of the key points.

[0013] According to one embodiment of the present invention, identifying the human action category based on the three-dimensional coordinate information of the detected target and the three-dimensional coordinate information of the human skeletal key points includes: using a skeletal key point action recognition network to obtain a preliminary classification structure of the human action based on the sequence of three-dimensional coordinate information of the human skeletal key points; obtaining the relative position information of the human skeletal key points and the object based on the three-dimensional coordinate information of the detected target and the three-dimensional coordinate information of the human skeletal key points; and obtaining the human action category based on the relative position information and the preliminary classification structure.

[0014] According to one embodiment of the present invention, the object includes at least one of the vehicle's steering wheel, door handle, window, and center console, and the human body action category includes at least one of the following: whether the driver's hands have left the steering wheel, whether the passenger's hands have approached the door handle, whether the passenger's or driver's hands have extended out of the window, and whether the passenger's or driver's hands have touched the center console.

[0015] To achieve the above objectives, a second aspect of the present invention provides a human motion recognition device applied to a vehicle. The device includes: an acquisition module for acquiring the RGB image and the depth image; a fusion module for fusing the RGB image and the depth image to obtain a fused image; a detection module for performing target detection on the fused image to obtain detection bounding box information of the detected targets, wherein the detected targets include human bodies and objects, and obtaining three-dimensional coordinate information of the detected targets based on the detection bounding box information and the depth image, and performing human skeletal keypoint detection on the depth image to obtain three-dimensional coordinate information of the human skeletal keypoints; and a recognition module for recognizing the human motion category based on the three-dimensional coordinate information of the detected targets and the three-dimensional coordinate information of the human skeletal keypoints.

[0016] To achieve the above objectives, a third aspect of the present invention provides a computer-readable storage medium having a computer program stored thereon, wherein when the computer program is executed by a processor, it implements the human motion recognition method as described above.

[0017] To achieve the above objectives, a fourth aspect of the present invention provides a vehicle including the human motion recognition device as described above.

[0018] The human motion recognition method, apparatus, storage medium, and vehicle of this invention acquire RGB images using an RGB acquisition device and depth images using a depth camera. The RGB and depth images are then fused. The fused image is input into a deep neural network model for target detection, obtaining detection bounding box information for the detected target. Based on the detection bounding box information and the depth image, the three-dimensional coordinate information of the detected target is obtained. Human skeletal keypoint detection is performed on the depth image to obtain the three-dimensional coordinate information of the human skeletal keypoints. Based on the three-dimensional coordinate information of the detected target and the three-dimensional coordinate information of the human skeletal keypoints, the human motion category is identified. This human motion recognition method can accurately detect human motion categories under both normal and low-light conditions, and improves detection accuracy by utilizing human-object interaction modeling. Attached Figure Description

[0019] Figure 1 This is a flowchart of a human motion recognition method according to an embodiment of the present invention;

[0020] Figure 2 This is a flowchart of the fusion of EGB images and depth images according to an embodiment of the present invention;

[0021] Figure 3 This is a schematic diagram of a deep neural network model according to an embodiment of the present invention;

[0022] Figure 4 This is a flowchart illustrating how to obtain the three-dimensional coordinate information of a detected target according to an embodiment of the present invention;

[0023] Figure 5 This is a schematic diagram of a human skeleton key point detection model according to an embodiment of the present invention;

[0024] Figure 6 This is a schematic diagram of the feature layer in a human skeleton key point detection model according to an embodiment of the present invention;

[0025] Figure 7 This is a flowchart illustrating an embodiment of the present invention for identifying human motion categories based on the three-dimensional coordinate information of the detected target and the three-dimensional coordinate information of key points of the human skeleton;

[0026] Figure 8 This is a schematic diagram of the structure of a human motion recognition device according to an embodiment of the present invention;

[0027] Figure 9 This is a schematic diagram of the structure of a vehicle according to an embodiment of the present invention. Detailed Implementation

[0028] Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein the same or similar reference numerals denote the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the accompanying drawings are exemplary and intended to explain the present invention, and should not be construed as limiting the present invention.

[0029] The method, apparatus, storage medium, and vehicle for human motion recognition according to embodiments of the present invention will now be described in detail with reference to the accompanying drawings and specific implementation methods.

[0030] Figure 1 This is a flowchart of a human motion recognition method according to an embodiment of the present invention.

[0031] In embodiments of the present invention, the human motion recognition method is applied to vehicles, such as... Figure 1 As shown, methods for human motion recognition include:

[0032] S1, acquire RGB image and depth image.

[0033] Specifically, RGB images can be obtained by an RGB acquisition device, and depth images by a depth camera. However, due to the limited space inside the vehicle cabin, the number of cameras inside is also limited. To better model the human body and objects inside the vehicle cabin, this invention can install both an RGB acquisition device and a depth camera in the vehicle. The RGB acquisition device acquires RGB images of the vehicle cabin. The depth camera can be an iToF camera, which acquires depth images of the vehicle cabin. Combining the RGB acquisition device and the depth camera ensures that the vehicle can accurately detect human movements under both normal and low-light conditions.

[0034] More specifically, before acquiring the RGB images from the RGB acquisition device and the depth images from the depth camera, the positions of the RGB acquisition device and the depth camera need to be fixed. The camera's fixed position inside the cabin needs to be adjusted according to the camera's parameters, ideally ensuring that the camera's field of view covers the areas to be detected within the cabin, including the driver and objects near the driver. The acquired RGB images from the RGB acquisition device and the depth images from the depth camera need to be time-aligned. After acquiring the RGB images from the RGB acquisition device and the depth images from the depth camera, data fusion processing is performed on the RGB images and the depth images.

[0035] S2 performs a fusion process on the RGB image and the depth image to obtain a fused image.

[0036] Specifically, an RGB acquisition device acquires RGB images, a depth camera acquires depth images, and the RGB and depth images are fused to obtain a fused image. Before fusing the RGB and depth images, the RGB acquisition device and the depth camera need to be calibrated.

[0037] In embodiments of the present invention, such as Figure 2 As shown, the fusion process for RGB and depth images includes:

[0038] S21, acquire the intrinsic and extrinsic parameters of the RGB acquisition device to obtain the first intrinsic and extrinsic parameters, and acquire the intrinsic and extrinsic parameters of the depth camera to obtain the second intrinsic and extrinsic parameters.

[0039] S22, acquire the geometric positional relationship between the RGB acquisition device and the depth camera.

[0040] S23, perform fusion processing on the RGB image and the depth image based on the first intrinsic and extrinsic parameters, the second intrinsic and extrinsic parameters, and the geometric positional relationship.

[0041] Specifically, the Zhang Zhengyou calibration method can be used to calibrate the RGB acquisition device and the depth camera. First, the intrinsic and extrinsic parameters of the RGB acquisition device are obtained according to the calibration algorithm to obtain the first intrinsic and extrinsic parameters, and the intrinsic and extrinsic parameters of the depth camera are obtained to obtain the second intrinsic and extrinsic parameters. The intrinsic and extrinsic parameters of the RGB acquisition device and the depth camera are then calculated using the following formula from the Zhang Zhengyou calibration method:

[0042]

[0043] Where s is the scale factor, (u, v) represents the pixel coordinates of a point in the image in the pixel coordinate system, (X, Y, Z) represents the physical coordinates of that point in the image in the world coordinate system, M1 represents the intrinsic parameter matrix of the camera, and M2 represents the extrinsic parameter matrix of the camera. By calibrating the RGB acquisition device and the depth camera, the corresponding intrinsic and extrinsic parameter matrices of the RGB acquisition device and the depth camera are calculated, thereby obtaining the transformation relationship between the world coordinate system and the pixel coordinate system in the RGB acquisition device and the depth camera.

[0044] More specifically, the geometric positional relationship between the RGB acquisition device and the depth camera is obtained. The RGB image and the depth image are then fused by combining the transformation relationship between the world coordinate system and the pixel coordinate system in the RGB acquisition device and the depth camera, and the geometric positional relationship between the RGB acquisition device and the depth camera, to obtain a fused image.

[0045] It should be noted that the geometric positional relationship between the RGB acquisition device and the depth camera is the geometric positional relationship between the RGB acquisition device and the depth camera in the world coordinate system. By transforming the world coordinate system and pixel coordinate system in the RGB acquisition device and the depth camera and the geometric positional relationship between the RGB acquisition device and the depth camera, the pixel correspondence between the RGB acquisition device and the depth camera can be obtained. Based on the pixel correspondence between the RGB acquisition device and the depth camera, the RGB image and the depth image are fused to obtain the fused image.

[0046] S3 performs target detection on the fused image to obtain the detection bounding box information of the detected targets, including human bodies and objects.

[0047] Specifically, object detection algorithms can be used to detect objects in the fused image, obtaining the bounding box information of the detected objects in the fused image. The object detection algorithm first identifies the objects in the fused image, and then determines the coordinates of the center point of the detected object and the length and width of the bounding box through the bounding box. Detected objects include human figures and other objects in the fused image.

[0048] It should be noted that the coordinates of the center point of the detected target determined by the detection box are two-dimensional coordinates, that is, the pixel coordinates of the center point of the detected target. This invention can utilize a deep neural network model to implement the target detection algorithm, such as... Figure 3 As shown, the deep neural network model includes convolutional layers, pooling layers, upsampling layers, and fully connected layers. The fused image is input into the deep neural network model, which then uses an object detection algorithm to detect human figures and other objects in the fused image, outputting object detection bounding boxes.

[0049] Specifically, the target detection algorithm can be any two-dimensional target detection algorithm. This invention uses the YOLOv5 target detection algorithm for detection. The YOLOv5 network structure is as follows: Figure 3 As shown, after inputting the fused image, the deep neural network model performs a series of operations on the fused image, including convolution, pooling, upsampling, and fully connected layers, to finally output two-dimensional target detection bounding boxes for the human body and objects. These bounding boxes include the two-dimensional pixel coordinates of the target's center point and the length and width of the detection box. Combining the bounding box information with the RGB acquisition device and depth camera intrinsic and extrinsic parameters obtained from the calibration described above, the two-dimensional bounding box information of the detected targets can be reconstructed into three-dimensional coordinate information in the world coordinate system.

[0050] S4. Based on the detection box information and the depth image, obtain the three-dimensional coordinate information of the detected target.

[0051] Specifically, based on the detection bounding box information of the detected target in the fused image obtained by the target detection algorithm, and combined with the correspondence between the fused image and the depth image, the two-dimensional coordinate information of the detected target can be converted into three-dimensional coordinate information in the world coordinate system.

[0052] In embodiments of the present invention, the detection box information includes the two-dimensional coordinate information of the detected target, such as... Figure 3 As shown, based on the detection bounding box information and the depth image, the three-dimensional coordinate information of the detected target is obtained, including:

[0053] S41, obtain the correspondence between the fused image and the depth image.

[0054] S42, based on the two-dimensional coordinate information and corresponding relationship of the detected target, the depth information of the detected target is obtained.

[0055] S43, based on the two-dimensional coordinate information and depth information of the detected target, obtain the three-dimensional coordinate information of the detected target.

[0056] Specifically, based on the calibration results of the RBG camera and depth camera described above, the correspondence between the fused image and the depth image is obtained. Then, the depth information of the detected target can be obtained based on the two-dimensional coordinate information of the target and the correspondence between the fused image and the depth image. The detected targets include human bodies and objects in the fused image. Based on the depth information of the detected target and the two-dimensional information of the detected target obtained by the target detection algorithm, the detected target is restored to the world coordinate system to obtain the three-dimensional coordinate information of the detected target. The above target detection algorithm can be used to obtain the three-dimensional coordinates of objects within the field of view. For example, the positions of common objects in a car cabin can be measured in advance, including but not limited to the steering wheel, door handles, windows, and center console. After obtaining the position information of the objects in the car cabin, the three-dimensional coordinate information of the human skeleton keypoints is detected using a human skeleton keypoint detection algorithm.

[0057] S5 performs human skeleton key point detection on the depth image to obtain the three-dimensional coordinate information of the human skeleton key points.

[0058] Specifically, human skeletal keypoint detection uses depth images as input. These images are then stitched together and fed into a deep neural network model. The 3D coordinates of the keypoints are regressed end-to-end. Conventional human skeletal keypoint detection models take RGB or grayscale images as input and output 2D coordinates. This invention, to improve the accuracy of cabin modeling, uses depth images acquired by a depth camera as input and modifies the deep neural network model to accurately output the 3D coordinates of the keypoints.

[0059] In an embodiment of the present invention, a key point detection model is used to detect human skeleton key points in a depth image. The number of human skeleton key points is N. The key point detection model includes N feature layers, which are used to regress the depth information of the key points, where N is an integer greater than 1.

[0060] Specifically, the depth image is input into the keypoint detection model to detect key points of the human skeleton. Key point detection of the human skeleton can be performed using methods such as... Figure 5 The hourglass structure model shown expands the feature dimension through downsampling, then preserves the key point coordinate information through upsampling and skip connections, and finally regresses the two-dimensional coordinates of the skeletal key points through heatmaps. Conventional hourglass networks are only used to detect the two-dimensional coordinates of objects and the human body. In this invention, a feature layer is added for each skeletal key point to regress the depth information of the key point.

[0061] In an embodiment of the present invention, the key point detection model further includes a fully connected layer, with N feature layers connected to the fully connected layer. The fully connected layer is used to regress the three-dimensional coordinate information of human skeletal key points based on the depth information of the key points.

[0062] Specifically, the number of key points in the human skeleton is N, such as Figure 6 As shown, N feature layers are added after each human skeleton keypoint model to regress the depth information of the keypoints. After concatenation and fully connected layers, the three-dimensional coordinates of the human skeleton keypoints are finally regressed to supervise the depth information. Finally, combined with the two-dimensional coordinate information obtained from the fused image, end-to-end acquisition of the three-dimensional coordinate information of the human skeleton keypoints is achieved.

[0063] It should be noted that the human skeleton key point detection network structure is not limited to the hourglass model and can be replaced by any regression model. Ultimately, after training, the three-dimensional coordinate information of the human skeleton can be obtained.

[0064] S6 identifies the type of human movement based on the three-dimensional coordinate information of the detected target and the three-dimensional coordinate information of key points of the human skeleton.

[0065] Specifically, by combining the three-dimensional coordinate information of the human skeleton key points obtained from the above human skeleton key point detection model with the three-dimensional coordinate information of the detection target, the human movement category can be identified.

[0066] In embodiments of the present invention, such as Figure 7 As shown, based on the three-dimensional coordinate information of the detected target and the three-dimensional coordinate information of key points of the human skeleton, the categories of human movements are identified, including:

[0067] S61, using the skeletal keypoint motion recognition network, obtains a preliminary classification structure of human motion based on the three-dimensional coordinate information sequence of human skeletal keypoints.

[0068] S62, based on the three-dimensional coordinate information of the detected target and the three-dimensional coordinate information of the key points of the human skeleton, obtain the relative position information of the key points of the human skeleton and the object.

[0069] S63, based on relative position information and preliminary classification structure, obtain the categories of human movements.

[0070] Specifically, the 3D coordinate information of human skeletal keypoints obtained through the skeletal keypoint detection model is input into a skeletal keypoint action recognition network over a period of time to perform preliminary classification of human actions, resulting in a preliminary classification structure. Preliminary classifications include actions such as bringing a hand close to the face or reaching out, which are only related to the limbs. Skeletal keypoint detection networks include, but are not limited to, ST-GCN and PoseC3D. Taking the ST-GCN algorithm as an example, the data of human skeletal keypoints is encoded to form a graph structure. Then, graph convolution and temporal convolution are used to extract the temporal and spatial features of the keypoints. Finally, a simple classification network outputs a score for each type of action, and the final classification category is determined.

[0071] More specifically, based on the 3D coordinate information of the target and the 3D coordinate information of key points on the human skeleton, the key points on the human skeleton and the target can be modeled into the same 3D coordinate space. In this space, the relative position information between the key points on the human skeleton and the object can be determined through preset logic. Then, based on the relative position information between the key points on the human skeleton and the object and the preliminary classification structure, the human movement is further subdivided and judged to obtain the human movement category.

[0072] In embodiments of the present invention, the objects include at least one of the vehicle's steering wheel, door handle, window, and center console, and the categories of human actions include whether the driver's hands have left the steering wheel, whether the passenger's hands have approached the door handle, whether the passenger's or driver's hands have extended out of the window, and whether the passenger's or driver's hands have touched at least one of the center console.

[0073] Specifically, the objects to be detected can include at least one of the vehicle's steering wheel, door handles, windows, and center console. The three-dimensional coordinates of fixed objects within the vehicle cabin are detected beforehand using object detection algorithms, allowing for better interaction modeling with human movements. Finally, the human movements are further subdivided and judged, resulting in categories such as whether the driver's hands have left the steering wheel, whether the passenger's hands are near the door handle, whether the passenger's or driver's hands are extended out of the window, and whether the passenger's or driver's hands are touching at least one of the center console.

[0074] More specifically, the system can determine whether a human action is dangerous by classifying it. If it is dangerous, a corresponding prompt can be output, such as a voice message.

[0075] The human motion recognition method of this invention acquires RGB images using an RGB acquisition device and depth images using a depth camera. This combination of RGB acquisition and depth camera enables accurate detection of human motion categories under both normal and low-light conditions. The RGB and depth images are then fused, and the fused image is input into a deep neural network model for target detection, obtaining the detection box information. Based on the detection box information and the depth image, the three-dimensional coordinate information of the detected target is obtained. Human skeletal keypoint detection is performed on the depth image to obtain the three-dimensional coordinate information of the human skeletal keypoints. Interactive modeling is performed based on the three-dimensional coordinate information of the detected target and the three-dimensional coordinate information of the human skeletal keypoints to identify the preliminary structure of the human motion. Finally, by combining the relative position information of the human skeletal keypoints and the object, the human motion category is further subdivided. This human motion recognition method can improve the accuracy of human motion detection.

[0076] The present invention also proposes a device for human motion recognition.

[0077] In embodiments of the present invention, such as Figure 8 As shown, the human motion recognition device 100 is applied to a vehicle. The vehicle is equipped with an RGB acquisition device and a depth camera. The human motion recognition device 100 includes: an acquisition module 10, used to acquire RGB images acquired by the RGB acquisition device and depth images acquired by the depth camera; a fusion module 20, used to fuse the RGB images and the depth images to obtain a fused image; a detection module 30, used to perform target detection on the fused image to obtain detection box information of the detected targets, wherein the detected targets include human bodies and objects, and obtain the three-dimensional coordinate information of the detected targets based on the detection box information and the depth image, and perform human skeleton key point detection on the depth image to obtain the three-dimensional coordinate information of the human skeleton key points; and a recognition module 40, used to recognize the human motion category based on the three-dimensional coordinate information of the detected targets and the three-dimensional coordinate information of the human skeleton key points.

[0078] It should be noted that other specific embodiments of the human motion recognition device of the present invention can be found in the specific embodiments of the human motion recognition method of the above embodiments of the present invention.

[0079] The present invention also proposes a computer-readable storage medium.

[0080] In an embodiment of the present invention, a computer program is stored on a computer-readable storage medium, and when the computer program is executed by a processor, it implements the human motion recognition method as described above.

[0081] The present invention also proposes a vehicle.

[0082] In embodiments of the present invention, such as Figure 9 As shown, the vehicle 1000 includes a human motion recognition device 100 as described above.

[0083] The human motion recognition method, apparatus, storage medium, and vehicle of this invention acquire RGB images using an RGB acquisition device and depth images using a depth camera. The combination of the RGB acquisition device and the depth camera enables the vehicle to accurately detect human motion categories under both normal and low-light conditions. The RGB images and depth images are fused, and the fused image is input into a deep neural network model for target detection, obtaining the detection box information of the detected target. Then, based on the detection box information and the depth image, the three-dimensional coordinate information of the detected target is obtained. Human skeletal keypoint detection is performed on the depth image to obtain the three-dimensional coordinate information of the human skeletal keypoints. Interactive modeling is performed based on the three-dimensional coordinate information of the detected target and the three-dimensional coordinate information of the human skeletal keypoints to identify the preliminary structure of the human motion. Finally, by combining the relative position information of the human skeletal keypoints and the object, the human motion category is further subdivided. This human motion recognition method can improve the accuracy of human motion detection.

[0084] It should be noted that the logic and / or steps represented in the flowchart or otherwise described herein, for example, can be considered as a sequenced list of executable instructions for implementing logical functions, and can be embodied in any computer-readable medium for use by, or in conjunction with, an instruction execution system, apparatus, or device (such as a computer-based system, a processor-included system, or other system that can fetch and execute instructions from, an instruction execution system, apparatus, or device). For the purposes of this specification, "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transmit programs for use by, or in conjunction with, an instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of computer-readable media include: an electrical connection having one or more wires (electronic device), a portable computer disk drive (magnetic device), random access memory (RAM), read-only memory (ROM), erasable and editable read-only memory (EPROM or flash memory), fiber optic devices, and portable optical disc read-only memory (CDROM). Alternatively, the computer-readable medium may be paper or other suitable media on which the program can be printed, since the program can be obtained electronically, for example, by optically scanning the paper or other medium, followed by editing, interpreting, or otherwise processing as necessary, and then stored in a computer memory.

[0085] It should be understood that various parts of the present invention can be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, multiple steps or methods can be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, it can be implemented using any one or a combination of the following techniques known in the art: discrete logic circuits having logic gates for implementing logical functions on data signals, application-specific integrated circuits (ASICs) having suitable combinational logic gates, programmable gate arrays (PGAs), field-programmable gate arrays (FPGAs), etc.

[0086] In the description of this specification, references to terms such as "one embodiment," "some embodiments," "example," "specific example," or "some examples," etc., indicate that a specific feature, structure, material, or characteristic described in connection with that embodiment or example is included in at least one embodiment or example of the invention. In this specification, the illustrative expressions of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the specific features, structures, materials, or characteristics described may be combined in any suitable manner in one or more embodiments or examples.

[0087] In the description of this invention, it should be understood that the terms "center," "longitudinal," "lateral," "length," "width," "thickness," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," "clockwise," "counterclockwise," "axial," "radial," and "circumferential" indicate the orientation or positional relationship based on the orientation or positional relationship shown in the accompanying drawings. They are used only for the convenience of describing this invention and simplifying the description, and are not intended to indicate or imply that the device or element referred to must have a specific orientation, or be constructed and operated in a specific orientation. Therefore, they should not be construed as limitations on this invention.

[0088] Furthermore, the terms "first" and "second" are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one of that feature. In the description of this invention, "a plurality of" means at least two, such as two, three, etc., unless otherwise explicitly specified.

[0089] In this invention, unless otherwise explicitly specified and limited, the terms "installation," "connection," "linking," and "fixing," etc., should be interpreted broadly. For example, they can refer to a fixed connection, a detachable connection, or an integral part; they can refer to a mechanical connection or an electrical connection; they can refer to a direct connection or an indirect connection through an intermediate medium; they can refer to the internal communication of two components or the interaction between two components, unless otherwise explicitly limited. Those skilled in the art can understand the specific meaning of the above terms in this invention according to the specific circumstances.

[0090] In this invention, unless otherwise explicitly specified and limited, "above" or "below" the second feature can mean that the first feature is in direct contact with the second feature, or that the first feature is in indirect contact with the second feature through an intermediate medium. Furthermore, "above," "over," and "on top" of the second feature can mean that the first feature is directly above or diagonally above the second feature, or simply that the first feature is at a higher horizontal level than the second feature. "Below," "below," and "under" the second feature can mean that the first feature is directly below or diagonally below the second feature, or simply that the first feature is at a lower horizontal level than the second feature.

[0091] Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention. Those skilled in the art can make changes, modifications, substitutions and variations to the above embodiments within the scope of the present invention.

Claims

1. A method for human motion recognition, characterized in that, The method is applied to a vehicle, and the method includes: Acquire RGB and depth images; The RGB image and the depth image are fused to obtain a fused image; Target detection is performed on the fused image to obtain detection bounding box information of the detected targets, wherein the detected targets include human bodies and objects; Based on the detection box information and the depth image, the three-dimensional coordinate information of the detected target is obtained; Human skeleton key point detection is performed on the depth image to obtain the three-dimensional coordinate information of the human skeleton key points; Based on the three-dimensional coordinate information of the detected target and the three-dimensional coordinate information of the key points of the human skeleton, the human movement category is identified; Specifically, a keypoint detection model is used to detect human skeleton keypoints in the depth image. The number of human skeleton keypoints is N. The keypoint detection model includes N feature layers, which are used to regress the depth information of the keypoints. Here, N is an integer greater than 1.

2. The method for human motion recognition according to claim 1, characterized in that, The fusion process of the RGB image and the depth image includes: The first set of intrinsic and extrinsic parameters is obtained by acquiring the intrinsic and extrinsic parameters of the RGB acquisition device, and the second set of intrinsic and extrinsic parameters is obtained by acquiring the intrinsic and extrinsic parameters of the depth camera. Obtain the geometric positional relationship between the RGB acquisition device and the depth camera; The RGB image and the depth image are fused based on the first intrinsic and extrinsic parameters, the second intrinsic and extrinsic parameters, and the geometric positional relationship.

3. The method for human motion recognition according to claim 1, characterized in that, The detection box information includes the two-dimensional coordinate information of the detected target. Obtaining the three-dimensional coordinate information of the detected target based on the detection box information and the depth image includes: Obtain the correspondence between the fused image and the depth image; The depth information of the detected target is obtained based on the two-dimensional coordinate information of the detected target and the corresponding relationship; Based on the two-dimensional coordinate information and depth information of the target, the three-dimensional coordinate information of the target is obtained.

4. The method for human motion recognition according to claim 1, characterized in that, The key point detection model also includes a fully connected layer, and the N feature layers are connected to the fully connected layer. The fully connected layer is used to regress the three-dimensional coordinate information of the human skeleton key points based on the depth information of the key points.

5. The method for human motion recognition according to claim 1, characterized in that, The step of identifying human movement categories based on the three-dimensional coordinate information of the detected target and the three-dimensional coordinate information of the key points of the human skeleton includes: The skeletal keypoint motion recognition network obtains a preliminary classification structure of human motion based on the three-dimensional coordinate information sequence of the human skeletal keypoints. Based on the three-dimensional coordinate information of the detected target and the three-dimensional coordinate information of the key points of the human skeleton, the relative position information of the key points of the human skeleton and the object is obtained. The human movement category is obtained based on the relative position information and the preliminary classification structure.

6. The method for human motion recognition according to any one of claims 1-5, characterized in that, The objects include at least one of the vehicle's steering wheel, door handle, window, and center console, and the categories of human actions include whether the driver's hands have left the steering wheel, whether the passenger's hands have approached the door handle, whether the passenger's or driver's hands have extended out of the window, and whether the passenger's or driver's hands have touched at least one of the center console.

7. A device for human motion recognition, characterized in that, The device is applied to a vehicle, and the device includes: The acquisition module is used to acquire RGB images and depth images; The fusion module is used to fuse the RGB image and the depth image to obtain a fused image; The detection module is used to perform target detection on the fused image to obtain detection box information of the detected target, wherein the detected target includes human body and object, and obtain the three-dimensional coordinate information of the detected target based on the detection box information and the depth image, and perform human skeleton key point detection on the depth image to obtain the three-dimensional coordinate information of human skeleton key points; The recognition module is used to identify the type of human movement based on the three-dimensional coordinate information of the detected target and the three-dimensional coordinate information of the key points of the human skeleton; The detection module is further configured to: use a key point detection model to detect human skeleton key points in the depth image, wherein the number of human skeleton key points is N, the key point detection model includes N feature layers, and the N feature layers are used to regress the depth information of the key points, where N is an integer greater than 1.

8. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by the processor, it implements the human motion recognition method as described in any one of claims 1-6.

9. A vehicle, characterized in that, Includes the human motion recognition device as described in claim 7.