A gesture action recognition method and device, electronic equipment, and storage medium
By processing radar echo signals, the human tracking body and gesture target are separated, and feature vectors are generated for gesture recognition. This solves the problem of separating gesture targets from body targets, improves recognition accuracy and reliability, and adapts to complex environments.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SHENZHEN UNIV
- Filing Date
- 2024-04-01
- Publication Date
- 2026-06-12
Smart Images

Figure CN118072399B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of data processing technology, and in particular to a method, apparatus, electronic device, and storage medium for recognizing gestures. Background Technology
[0002] With the rapid development of electronic information technology and the popularization of smart devices, a variety of human-computer interaction methods have emerged, such as human posture, facial expressions, voice interaction, and gesture recognition. Among them, gesture recognition is an important human-computer interaction method that can greatly improve the user experience and convenience.
[0003] Gesture recognition in existing technologies typically includes three methods: wearable device-based gesture recognition, computer vision-based gesture recognition, and radar sensor-based gesture recognition. Wearable device-based gesture recognition uses wearable devices integrating various sensors to perform gesture recognition; however, this method requires users to wear the device for extended periods, resulting in high costs and a poor user experience. Computer vision-based gesture recognition uses various camera sensors, such as monocular cameras, binocular cameras, depth cameras, and infrared cameras, to capture static images or dynamic videos of gestures. However, this method is significantly affected by environmental and lighting conditions and requires substantial computing resources. Radar sensor-based gesture recognition uses radar sensors to emit modulated electromagnetic wave signals and demodulates and processes the reflected echo signals to obtain information about the distance, speed, and angle of the gesture target. Compared to wearable device-based and computer vision-based gesture recognition, radar sensor-based gesture recognition is largely unaffected by environmental and lighting conditions, exhibits strong anti-interference capabilities, and is low-cost. However, in radar sensor-based gesture recognition, there are often situations where the gesture target and the body target coexist in the radar's detection area. In this case, it is difficult to separate the gesture target from the body target. At the same time, body movements can also cause misjudgments of gesture movements, resulting in a decrease in the accuracy of gesture recognition. Summary of the Invention
[0004] This invention provides a method, apparatus, electronic device, and storage medium for recognizing gestures, in order to improve the accuracy and reliability of gesture recognition.
[0005] In a first aspect, embodiments of the present invention provide a method for recognizing gestures, the method comprising:
[0006] Determine the human target point cloud data, and determine the human tracking body based on the human target point cloud data;
[0007] If the human target action conditions are determined to be met based on the human body tracking body, then the action point cloud data and stable point cloud data are determined based on the human target point cloud data of each frame that meets the human target action conditions.
[0008] Based on the motion point cloud data and the stable point cloud data, determine the motion detection feature vector and the gesture recognition feature vector;
[0009] If the dynamic action condition is determined to be met based on the action detection feature vector, then body action recognition and gesture action recognition are performed based on the gesture recognition feature vector.
[0010] Secondly, embodiments of the present invention also provide a gesture recognition device, the device comprising:
[0011] The human body tracking body determination module is used to determine human target point cloud data and determine the human body tracking body based on the human target point cloud data;
[0012] The point cloud data separation module is used to determine the motion point cloud data and the stable point cloud data based on the human target point cloud data of each frame that meets the human target motion conditions if the human target motion conditions are determined to be met based on the human tracking body.
[0013] The feature vector determination module is used to determine the action detection feature vector and the gesture recognition feature vector based on the action point cloud data and the stable point cloud data.
[0014] The gesture recognition module is used to perform body motion recognition and gesture recognition based on the gesture recognition feature vector if the dynamic motion conditions are determined to be met based on the motion detection feature vector.
[0015] Thirdly, embodiments of the present invention also provide an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement a gesture recognition method as described in any of the embodiments of the present invention.
[0016] Fourthly, embodiments of the present invention also provide a storage medium for storing computer-executable instructions, which, when executed by a computer processor, are used to perform a gesture recognition method as described in any of the embodiments of the present invention.
[0017] The technical solution of this invention determines the human target tracking object through human target point cloud data, judges the human target tracking object, obtains human target point cloud data of each frame that meets the human target action conditions, and determines action point cloud data and stable point cloud data based on each frame of human target point cloud data; based on the action point cloud data and stable point cloud data, it determines action detection feature vector and gesture recognition feature vector; if the action detection feature vector determines that the dynamic action conditions are met, then body action recognition and gesture action recognition are performed based on the gesture recognition feature vector. This method can adapt to the simultaneous presence of a human body and gesture within the detection area, and the person can move freely within the area. It can separate the gesture action target from the human target, and can still effectively recognize normal gesture actions even under the interference of body actions, greatly reducing the misjudgment of actions in gesture recognition and improving the overall accuracy and reliability of gesture recognition. In addition, this method also has the following advantages: it is non-contact; it does not infringe on personal privacy; it is not affected by light or dust; it is small in size, easy to integrate, and has low power consumption, and can be embedded in a device.
[0018] It should be understood that the description in this section is not intended to identify key or essential features of the embodiments of the present invention, nor is it intended to limit the scope of the invention. Other features of the invention will become readily apparent from the following description. Attached Figure Description
[0019] To more clearly illustrate the technical solutions in the embodiments of the present invention, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0020] Figure 1 A flowchart of a gesture recognition method provided in Embodiment 1 of the present invention;
[0021] Figure 2 A flowchart of another gesture recognition method provided in Embodiment 1 of the present invention;
[0022] Figure 3 This is a schematic diagram of the structure of a gesture recognition device provided in Embodiment 2 of the present invention;
[0023] Figure 4 This is a schematic diagram of the structure of an electronic device provided in Embodiment 3 of the present invention. Detailed Implementation
[0024] To enable those skilled in the art to better understand the present invention, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present invention. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort should fall within the scope of protection of the present invention.
[0025] It should be noted that the terms "first," "second," etc., in the specification, claims, and accompanying drawings of this invention are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments of the invention described herein can be implemented in orders other than those illustrated or described herein. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover a non-exclusive inclusion; for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus.
[0026] Example 1
[0027] Figure 1 This is a flowchart of a gesture recognition method provided in Embodiment 1 of the present invention. This embodiment is applicable to situations where gestures are recognized. The method can be executed by a gesture recognition device, which can be implemented in hardware and / or software. The gesture recognition device can be configured in an electronic device with network connectivity.
[0028] like Figure 1 As shown, the method includes:
[0029] S110. Determine the human target point cloud data and determine the human tracking body based on the human target point cloud data.
[0030] Among them, the human target point cloud data is the point cloud data of the human body within the radar detection range obtained by radar scanning.
[0031] The radar device scans the human body in front of it, preprocesses the human body point cloud data obtained from the scan, uses the preprocessed human body point cloud data as human body target point cloud data, and filters the human body target point cloud data to obtain the human body tracking body.
[0032] Furthermore, the radar continuously transmits electromagnetic wave signals into space. These signals are scattered or reflected by objects and received by the radar receiver. After passing through a signal amplifier, mixer, and ADC sampling, a discrete echo signal containing range, time, and angle dimensions is obtained. The echo received by the radar module can be represented as y(m,n,k), where m is the slow time dimension, representing the m-th linear frequency modulated continuous wave signal; n is the fast time dimension, representing the n-th sampling point; and k is the antenna dimension, representing the received signal of the k-th channel. A Fast Fourier Transform is performed on the received radar echo in the fast time dimension to obtain the range-slow time-antenna dimension signal. Where r∈[1,N] represents range cell sampling. In real-world environments, radar echo signals contain not only hand gesture target signals, but also micro-dynamic interference target signals, static objects and background clutter, transmitter leakage signals, and noise. Therefore, preprocessing of the echo signal is necessary, i.e., clutter suppression. Algorithms such as pre-recorded background clutter subtraction, moving average algorithm, and singular value decomposition can be used to obtain the clutter-suppressed signal y. p (m,r,k).
[0033] Furthermore, the human target point cloud data is preprocessed to obtain the human tracking body, including angle estimation, target detection, velocity estimation, and coordinate transformation.
[0034] Furthermore, the angle estimation employs an angle estimation algorithm to estimate the angle, i.e., the angle of the clutter-suppressed signal y. p (m,r,k) uses a super-resolution angle estimation algorithm (2D-MUSIC) to estimate the angle of arrival. The specific formula is as follows:
[0035]
[0036]
[0037]
[0038] Among them, R kk,r The signal covariance matrix is constructed, where M is the number of chirp waveforms in a frame of human target point cloud data, and U... s Let Σ be the eigenvector matrix of the signal subspace. s U is the eigenvalue matrix of the signal subspace. n Let Σ be the eigenvector matrix of the noise subspace. n Let be the eigenvalue matrix of the noise subspace. Let θ be the two-dimensional steering vector of the array element, and θ represent the target azimuth angle. Indicates the target's pitch angle.
[0039] Among them, the angle estimation algorithm (2D-Multiple Signal Classification, 2D-MUSIC) is a subspace decomposition-based algorithm. It utilizes the orthogonality of the signal subspace and noise subspace to construct a spatial spectrum function and estimate the signal parameters through spectral peak search.
[0040] Furthermore, a constant false alarm rate detector (CFAR) can be used to detect false alarms. The principle behind human target detection is: for For each dimension, the average of the L reference cells on the left and right sides is taken, and the values of the two averages are compared. The value of the smaller value is taken as the clutter background level of the detected cell. Detection threshold When the detected cell size is greater than the detection threshold U0, a human target is considered to be present; when the detected cell size is less than the detection threshold U0, no human target is considered to be present. Adjusting the threshold parameter A changes the threshold U0, thereby controlling the false alarm rate. Since the position of a person within the detection area is uncertain, the threshold parameter A is set as an adaptive parameter that changes with the distance cell with the highest energy. The three-dimensional information of the detected human target is obtained. As a three-dimensional point cloud, it includes information about the target in three dimensions: distance r, azimuth θ, and elevation angle. And estimate the target's signal-to-noise ratio (SNR).
[0041] Among them, constant false alarm rate detector (CFAR) detection refers to a common form of adaptive algorithm used in radar systems to detect target echoes in the context of noise, clutter and interference.
[0042] Furthermore, the velocity estimation is based on a three-dimensional point cloud. Velocity estimation is performed, and to improve velocity resolution, a sparse recovery algorithm is used. In OMP velocity estimation, for a given point, the inner product of the residual and the spatial frequency vector can be expressed as:
[0043] p = a H *γ;
[0044] Where a is the spatial frequency steering vector and γ is the residual vector. Calculate the maximum value of the inner product p, update the residuals based on its corresponding index, and calculate the angle-Doppler image:
[0045]
[0046]
[0047] Among them, a(q+1) Let x be the spatial frequency steering vector corresponding to the maximum value of the inner product, and let x be the generated point cloud target data. The calculated least-squares solution is represented by the angle-Doppler image. The target velocity v is obtained from the angle-Doppler image and added to the 3D point cloud set to increase the velocity dimension information, resulting in a 4D point cloud set.
[0048] Furthermore, coordinate transformation involves defining the activity space area and retaining only the four-dimensional point cloud data of the human body within that area, while the four-dimensional point cloud data generated through the above steps... Since it is in spherical coordinates, to obtain more intuitive spatial information, its spherical coordinates are converted to Cartesian coordinates, resulting in a six-dimensional point cloud set PointCloud(x,y,z,v). x ,v y ,v z ).
[0049]
[0050] Where r is the distance; θ is the azimuth angle; θ is the pitch angle; v is the velocity.
[0051] Furthermore, the human target is determined based on the human target point cloud data by ignoring the z-axis data of the human target point cloud data, compressing the three-dimensional coordinates into two dimensions, and combining the target signal-to-noise ratio (SNR) obtained in the above target detection to obtain the two-dimensional point cloud information PointCloud2D(x,y,SNR). This two-dimensional information is used to perform two-dimensional Kalman filter human target tracking, which includes four parts: target position prediction, point cloud and target data association, target position update, and track start and end.
[0052] First, the position of the tracked target obtained from the point cloud data of historical frames is predicted a priori based on the constant velocity model. If the number of tracked targets obtained previously is 0, this step is skipped. Then, the two-dimensional human target point cloud data and the tracked target point cloud data are associated based on distance, velocity difference, number of points, and signal-to-noise ratio association rules. That is, the two-dimensional human target point cloud data obtained in the current frame is associated with the tracked targets identified from the human target point cloud data of historical frames, and the position of the tracked target in consecutive frames is updated. For human target point cloud data that is not associated in the current frame, it indicates that it may be generated by a new human target. Based on the number and density of human target point clouds, it is determined whether to establish a new human target and obtain the position of the newly established human target in the current frame. If a human target is not associated with human target point cloud data for several consecutive frames, it is considered that the human target has left the radar detection area. Furthermore, among all the tracked targets obtained, those that meet the preset first distance condition are the target human targets to be determined.
[0053] S120. If the human target action conditions are met based on the human body tracking body, then the action point cloud data and stable point cloud data are determined based on the human target point cloud data of each frame that meets the human target action conditions.
[0054] Among them, the human target action condition is the condition for determining whether the human target has made an action.
[0055] Motion point cloud data refers to the point cloud data corresponding to the parts of the human body that show changes in motion when the human body being tracked moves; stable point cloud data refers to the point cloud data corresponding to the parts of the human body that do not show changes in motion or show only minor changes in motion when the target human body in the human body being tracked moves.
[0056] If the point cloud data of the human body being tracked meets the conditions for the human target action, it indicates that the target human body has performed an action. At this time, based on the human target point cloud data of each frame that meets the conditions for the human target action, the action point cloud data and the stable point cloud data are determined, thus realizing the separation of action point cloud data and stable point cloud data, which facilitates subsequent gesture recognition based on the action point cloud data.
[0057] Optionally, determining the conditions for satisfying the target human movement based on the human body tracker includes steps A1-A3:
[0058] Step A1: Identify the target human body that meets the first preset distance condition of the radar.
[0059] The first preset distance condition can be the closest distance to the radar, meaning the human body tracked closest to the radar is selected as the target human body tracked; or it can be the closest distance to the radar and this distance is less than or equal to a distance threshold. This embodiment also supports gesture recognition for multiple people. Correspondingly, the first preset distance condition can be the distance to the radar being less than or equal to the distance threshold, meaning any human body tracked at a distance less than or equal to the distance threshold can be selected as the target human body tracked, and subsequent gesture recognition can be performed on each target human body tracked separately. This embodiment does not limit the specific content of the first preset distance condition.
[0060] Step A2: Determine the target human body spatial cube based on the target human body tracking body.
[0061] The target human body space cube contains all the human body target point cloud data of the target human body tracking body.
[0062] A target human body spatial cube is established with the position of the target human body tracked as the center. This target human body spatial cube includes all human target point cloud data PointCloud(x,y,z,v) corresponding to the target human body tracked. x ,vy ,v z The side length Cube(a,b,c) of the target human body spatial cube is calculated based on the human body target point cloud data.
[0063] The target human body space cube can be the smallest outer cube of the point cloud data of each human body corresponding to the target human body tracking body. The method of obtaining it includes: calculating the normal vector of each human body target point cloud data, projecting all human body target point cloud data of the human body tracking body onto a two-dimensional surface to obtain the target human body space cube.
[0064] The two-dimensional surface is obtained as follows: the bounding box algorithm is used to calculate the minimum bounding cube of all human target point cloud data sets. The origin is defined as the lower left rear vertex of the minimum bounding cube. The x-axis and y-axis are parallel to the side of the minimum bounding cube, and the z-axis is perpendicular to the bottom and upward. The z-axis direction is defined as the main mapping direction. The xy-plane is the two-dimensional surface.
[0065] Among them, the bounding box algorithm is an algorithm for finding the optimal bounding space of a discrete point set. The basic idea is to use a geometric object with a slightly larger volume and simpler characteristics (called a bounding box) to approximate a complex geometric object.
[0066] Step A3: If it is determined that the side length of the target human body space cube in the current frame is greater than or equal to the side length threshold, then the current frame is determined to meet the human body target action condition.
[0067] The side length threshold can be set based on empirical data, or it can be determined based on the average side length of the target human body space cube in each frame.
[0068] Furthermore, the side length threshold can be determined by: averaging the side lengths of the target human body space cube in the T frames before and after the current frame, comparing the average values of the previous T frames and the next T frames, and using the smaller average value as the background level for the side length of the target human body space cube in the current frame. Detection threshold And based on background level The size of the parameter W is assigned different values.
[0069] If the side length of the target human body space cube in the current frame is determined to be greater than or equal to the side length threshold, then the target human body space cube in the current frame is considered to have undergone a sudden change, that is, the target human body has performed a movement, indicating that the current frame meets the human body target movement condition.
[0070] Optionally, based on the human target point cloud data of each frame that meets the human target action conditions, determine the motion point cloud data and the stable point cloud data, including steps B1-B3:
[0071] Step B1: Determine the target human space cube for each frame that satisfies the target human action conditions.
[0072] Step B2: If the difference between the side length of the target human space cube corresponding to the target frame and the side length of the target human space cube corresponding to the adjacent frame of the target frame is greater than or equal to a preset difference threshold, then the target human space cube corresponding to the target frame is deleted.
[0073] The preset difference threshold can be set based on experience; or it can be set based on the product of the average side length of the target human body space cube in each frame and a preset ratio.
[0074] The target frame is one of the frames in the target human body space cube that needs to be judged for action.
[0075] The adjacent frames of the target frame can be either the two adjacent frames of the target frame or one of the two adjacent frames of the target frame.
[0076] Calculate the side length of the target human body space cube corresponding to the target frame, and subtract the side length of the target human body space cube corresponding to the target frame from the side length of the target human body space cube corresponding to the adjacent frame. If the difference is greater than or equal to the preset difference threshold, the target frame is considered to be greatly affected by clutter, and the target human body space cube corresponding to the target frame is deleted.
[0077] In this embodiment, the target human space cube of the complete action frame is obtained under multi-frame detection, and clutter frames are deleted according to the fluctuation of the side length of the cube in adjacent frames. This can eliminate the influence of clutter on gesture recognition and improve the accuracy of gesture recognition.
[0078] Step B3: Based on the human target point cloud data corresponding to the other target human space cubes after deletion, determine the motion point cloud data and the stable point cloud data.
[0079] Optionally, determine the motion point cloud data and the stable point cloud data, including steps C1-C3:
[0080] Step C1: Determine the point cloud data of human targets at each altitude layer that meet the second preset distance condition with the radar.
[0081] Meeting the second preset distance condition can be achieved by the distance to the radar being less than or equal to a second preset distance threshold; it can also be achieved by the distance to a preset face of the target human body space cube being less than or equal to a second preset distance threshold, wherein the preset face of the target human body space cube is the face of the target human body space cube closest to the radar; or it can be achieved by layering the human target point cloud data along the height direction, sorting the distances between the human target point cloud data at each height layer and the radar or the preset faces of the target human body space cube in ascending order, and selecting the human target point cloud data at the height layers with the smallest preset distances as the human target point cloud data that meets the second preset distance condition.
[0082] Step C2: For the human target point cloud data at the target height layer, determine the human target point cloud data that meets the third preset distance condition with the radar as the motion point cloud data.
[0083] The condition of satisfying the third preset distance can mean that the distance between the target and the radar or the distance between the target and the preset face of the target human body space cube is less than the third preset distance threshold; or it can mean that the point cloud data of each human body target at the target height layer are sorted in ascending order of distance between the target and the radar or the preset face of the target human body space cube, and the preset number or preset proportion of human body target point cloud data with the smallest distance are taken as the human body target point cloud data at the target height layer that satisfy the third preset distance condition.
[0084] Understandably, during gesture recognition, the radar is typically positioned facing the front of the human target. Therefore, the point cloud data corresponding to the human target's hand is usually point cloud data that is close to the radar or a preset face of the target's human body spatial cube. Therefore, this embodiment first layers the human target point cloud data along the height direction to obtain human target point cloud data at each height layer that meets the second preset distance condition. Then, the human target point cloud data at each height layer that meets the second preset distance condition is further segmented according to the distance to the radar or the distance to a preset face of the target's human body spatial cube to obtain human target point cloud data that meets the third preset distance condition, which is considered to be the point cloud data corresponding to the human target's hand. The technical solution of this embodiment can effectively and accurately segment action point cloud data, thereby providing a data foundation for subsequent gesture recognition.
[0085] Step C3: Use the human target point cloud data other than the motion point cloud data as stable point cloud data.
[0086] The human target point cloud data other than the motion point cloud data in the human target point cloud data of each height layer are used as stable point cloud data.
[0087] For example, during the time period in which the action occurs, the target human spatial cube is layered by height, and distance judgment rules are created: for coordinate data facing the radar direction, at a certain height, if the distance between the coordinates of the human target point cloud data in that direction and the face of the target human spatial cube closest to the radar is small, then that height is determined to be one of the heights of the action point cloud data. All height layers that meet the rules constitute the overall height of the action point cloud data. Then, based on the scene requirements and the characteristics of effective gesture actions facing the radar, for each height, the top 20% of the human target point cloud data with the smallest distance to the face of the target human spatial cube closest to the radar can be taken as the action point cloud data for potential gesture action recognition, and the rest are stable point cloud data.
[0088] S130. Based on the motion point cloud data and the stable point cloud data, determine the motion detection feature vector and the gesture recognition feature vector.
[0089] Velocity and position features are extracted from motion point cloud data and stable point cloud data, and motion detection feature vectors and gesture recognition feature vectors are generated based on the velocity and position features.
[0090] Optionally, based on the action point cloud data and the stable point cloud data, determine the action detection feature vector and the gesture recognition feature vector, including steps D1-D2:
[0091] Step D1: Determine the action detection feature vector using the following formula:
[0092] F1 = [max(v gx ),max(v gz ),min(v gx ),min(v gz ),(v bx -v gx ),(v bz -v gz )];
[0093] Where F1 represents the action detection feature vector, v gx The velocity of the motion point cloud data is projected onto the x-axis, v. gz The velocity of the motion point cloud data is projected onto the z-axis, v. bx v represents the projection of stable point cloud data onto the x-axis. bz This represents the projection of stable point cloud data onto the z-axis.
[0094] Step D2: Determine the gesture recognition feature vector using the following formula:
[0095] F2 = [v bx ,v bz ,(v b -vg ),(H g -H b )];
[0096] Where F2 represents the gesture recognition feature vector, v b v represents the speed at which stable point cloud data is generated. g H represents the speed of the motion point cloud data. g H represents the height of the action point cloud data. b This indicates the height of the stable point cloud data.
[0097] For example, velocity and position features are extracted from motion point cloud data and stable point cloud data. For the velocity information (v) of stable point cloud data... b ,v bx ,v bz ), velocity information (v) of motion point cloud data g ,v gx ,v gz The height information H of stable point cloud data b The height information H of the motion point cloud data g We can obtain the eigenvector F1 = [max(v gx ),max(v gz ),min(v gx ),min(v gz ),(v bx -v gx ),(v bz -v gz )], eigenvector F2=[v bx ,v bz ,(v b -v g ),(H g -H b )).
[0098] S140. If the dynamic action condition is determined to be met based on the action detection feature vector, then body action recognition and gesture action recognition are performed based on the gesture recognition feature vector.
[0099] The system judges the motion based on the action detection feature vector. If the action detection feature vector meets the dynamic motion condition, it indicates that the human target's action is a valid dynamic action, and further body movement and gesture recognition can be performed based on the action detection feature vector. If the action detection feature vector does not meet the dynamic motion condition, it indicates that the human target's action is an invalid static action, and no further body movement and gesture recognition is needed.
[0100] Static movements typically refer to passive physical movements performed while the body is relatively still. Dynamic movements refer to actions performed by the body during motion. These movements usually involve changes in the body's position, posture, speed, and force.
[0101] The motion detection feature vector is used to detect motion based on dynamic and static motion characteristics.
[0102] Furthermore, based on the characteristics of static actions, such as subtle changes in action or actions depending on a certain medium, the action detection feature vector is used for detection. The "certain medium" includes objects such as massage props and walls.
[0103] Furthermore, based on the characteristics of dynamic actions, such as the action being completed independently by the target task without relying on external media and the large range of action changes, the action detection feature vector is used for detection.
[0104] If the action detection feature vector passes the dynamic action detection test, then the corresponding action point cloud data and stable point cloud data are considered to meet the dynamic action condition, meaning the target person's action is a dynamic action. Optionally, body action recognition and gesture action recognition are performed based on the gesture recognition feature vector, including:
[0105] The gesture recognition feature vector is input into the action recognition model pre-trained based on a deep neural network to perform body action recognition and gesture action recognition, and the body action recognition result and gesture action recognition result output by the action recognition model are obtained.
[0106] The gesture recognition feature vector is input into the action recognition model, which then recognizes the gesture recognition feature vector to obtain the body action recognition result and gesture action recognition result output by the action recognition model.
[0107] Furthermore, before performing action recognition, the gesture recognition feature vector is first resampled.
[0108] Furthermore, the action recognition model can be pre-trained based on gesture feature vectors and their corresponding action results, i.e., whether the gesture feature vectors represent body movements or hand gestures. Specifically, the gesture feature vectors and their corresponding action results are input into the action recognition model. Based on the action results output by the action recognition model, the accuracy of the action recognition model is determined. The model is then iteratively trained until its accuracy is greater than or equal to a preset accuracy threshold, at which point training stops.
[0109] Furthermore, the action recognition model can employ a Long Short-Term Memory (LSTM) neural network. An LSTM neural network is a commonly used neural network model for processing sequential data. It is an improved form of Recurrent Neural Network (RNN) used to address the vanishing and exploding gradient problems. LSTM units control the flow of information through gating mechanisms, thus better handling long sequence data. An LSTM unit mainly consists of three gating gates: an input gate, a forget gate, and an output gate. These gating gates determine whether to retain or update the cell state based on the input signal and historical information. For example, after resampling the gesture recognition feature vector, the feature sequence of the gesture recognition feature vector first passes through an LSTM layer with 64 hidden neurons, then through a fully connected softmax layer, and finally outputs the result directly. During network training, dropout is added between the LSTM and the fully connected layers to prevent overfitting.
[0110] For example, such as Figure 2 As shown, firstly, clutter suppression is performed on the received radar echo data. Then, the angle of the clutter-suppressed data is estimated using a 2D-MUSIC algorithm to obtain a two-dimensional point cloud. CFAR is then used to detect the target on the two-dimensional point cloud. Velocity estimation is performed on the three-dimensional point cloud after target detection, and the resulting four-dimensional point cloud is transformed to obtain human target point cloud data. A human tracking body is then established based on the human target point cloud data, and a target human spatial cube is constructed based on the human tracking body. Figure 2 The target human body spatial cube is used to determine motion point cloud data and stable point cloud data. Specifically, human body cube mutation detection is performed on the target human body spatial cube to separate stable and abrupt signals, obtaining motion point cloud data and stable point cloud data. Velocity and position features are extracted from the obtained motion point cloud data and stable point cloud data to generate motion detection feature vectors and gesture recognition feature vectors. Static motion detection is then performed on these feature vectors. The gesture recognition feature vector obtained through static motion detection is input into an LSTM neural network to determine the type of motion to which the gesture recognition feature vector belongs.
[0111] For example, in a gesture extraction scenario, the radar is side-mounted on a wall at a height of 1.35m, with a radar frame rate of 40Hz and a distance resolution of 2.37cm. Multiple sets of sample actions from multiple human targets are collected for normal gestures and different body movements (such as squatting, bending over, picking up objects, leaning forward, walking randomly, and performing actions with one's back to the radar), and the action recognition model is trained. Multiple sets of gesture actions and multiple sets of body movements are then tested on the test human targets. Test results show that the technical solution of this embodiment achieves a 99.2% accuracy rate in detecting gesture actions, and a 98% accuracy rate in detecting gesture actions within body movements.
[0112] The technical solution of this invention determines the human target tracking object through human target point cloud data, judges the human target tracking object, obtains human target point cloud data of each frame that meets the human target action conditions, and determines action point cloud data and stable point cloud data based on each frame of human target point cloud data; based on the action point cloud data and stable point cloud data, it determines action detection feature vector and gesture recognition feature vector; if the action detection feature vector determines that the dynamic action conditions are met, then body action recognition and gesture action recognition are performed based on the gesture recognition feature vector. This method can adapt to the simultaneous presence of a human body and gesture within the detection area, and the person can move freely within the area. It can separate the gesture action target from the human target, and can still effectively recognize normal gesture actions even under the interference of body actions, greatly reducing the misjudgment of actions in gesture recognition and improving the overall accuracy and reliability of gesture recognition. In addition, this method also has the following advantages: it is non-contact; it does not infringe on personal privacy; it is not affected by light or dust; it is small in size, easy to integrate, and has low power consumption, and can be embedded in a device.
[0113] Example 2
[0114] Figure 3 This is a schematic diagram of the structure of a gesture recognition device provided in Embodiment 2 of the present invention. Figure 3 As shown, the device includes:
[0115] Human body tracking object determination module 210: used to determine human target point cloud data, and determine human body tracking object based on the human target point cloud data;
[0116] Point cloud data separation module 220: If it is determined from the human body tracking body that the human target action conditions are met, then based on the human target point cloud data of each frame that meets the human target action conditions, determine the action point cloud data and the stable point cloud data.
[0117] Feature vector determination module 230: used to determine the action detection feature vector and the gesture recognition feature vector based on the action point cloud data and the stable point cloud data;
[0118] Gesture recognition module 240: If it is determined from the action detection feature vector that the dynamic action condition is met, then body action recognition and gesture recognition are performed based on the gesture recognition feature vector.
[0119] Optionally, the point cloud data separation module 220 includes:
[0120] Target Human Body Tracking Entity Determination Unit: Used to determine target human body tracking entities that meet the first preset distance condition with the radar;
[0121] Target human body spatial cube determination unit: used to determine the target human body spatial cube based on the target human body tracking body;
[0122] The target human body space cube includes all human target point cloud data of the target human body tracking body;
[0123] Human target action condition determination unit: If the side length of the target human spatial cube in the current frame is greater than or equal to the side length threshold, then the current frame satisfies the human target action condition.
[0124] Optionally, the point cloud data separation module 220 includes:
[0125] Target human body space cube determination unit for each frame: used to determine the target human body space cube for each frame that meets the human body target action conditions;
[0126] Noise reduction unit: If the difference between the side length of the target human space cube corresponding to the target frame and the side length of the target human space cube corresponding to the adjacent frame of the target frame is greater than or equal to a preset difference threshold, then the target human space cube corresponding to the target frame will be deleted.
[0127] Point cloud data determination unit: used to determine motion point cloud data and stable point cloud data based on the human target point cloud data corresponding to the other target human space cubes after deletion.
[0128] Optionally, the point cloud data separation module 220 includes:
[0129] Human target point cloud data determination unit at each altitude level: used to determine human target point cloud data at each altitude level that meets the second preset distance condition with the radar;
[0130] Action point cloud data determination unit: used to determine the human target point cloud data that meets the third preset distance condition with the radar as action point cloud data for human target point cloud data at the target height layer;
[0131] Stable point cloud data determination unit: used to use other human target point cloud data, excluding motion point cloud data, as stable point cloud data.
[0132] Optionally, the feature vector determination module 230 includes:
[0133] Action detection feature vector determination unit: used to determine the action detection feature vector using the following formula: F1=[max(v gx ),max(v gz ),min(v gx ),min(v gz ),(v bx -v gx ),(v bz -v gz )];
[0134] Where F1 represents the action detection feature vector, v gx The velocity of the motion point cloud data is projected onto the x-axis, v. gz The velocity of the motion point cloud data is projected onto the z-axis, v. bx v represents the projection of stable point cloud data onto the x-axis. bz This represents the projection of stable point cloud data onto the z-axis.
[0135] Gesture recognition feature vector determination unit: used to determine the gesture recognition feature vector using the following formula: F2=[v bx ,v bz ,(v b -v g ),(H g -H b )];
[0136] Where F2 represents the gesture recognition feature vector, v b v represents the speed at which stable point cloud data is generated. g H represents the speed of the motion point cloud data. g H represents the height of the action point cloud data. b This indicates the height of the stable point cloud data.
[0137] Optionally, the gesture recognition module 240 includes:
[0138] Dynamic motion detection unit: used to perform dynamic motion detection based on the motion detection feature vector;
[0139] Dynamic action condition judgment unit: If it is determined that the action detection feature vector can pass dynamic action detection, then it is determined that the dynamic action condition is met.
[0140] Optionally, the gesture recognition module 240 includes:
[0141] Recognition Result Determination Unit: Used to input the gesture recognition feature vector into the action recognition model pre-trained based on a deep neural network, perform body action recognition and gesture action recognition, and obtain the body action recognition result and gesture action recognition result output by the action recognition model.
[0142] The gesture recognition device provided in this embodiment of the invention can execute the gesture recognition method provided in any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.
[0143] Example 3
[0144] Figure 4 This is a schematic diagram of an electronic device provided in Embodiment 3 of the present invention. The electronic device is intended to represent various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. The electronic device can also represent various forms of mobile devices, such as personal digital processors, cellular phones, smartphones, wearable devices (such as helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions are merely illustrative and are not intended to limit the implementation of the invention described and / or claimed herein.
[0145] like Figure 4 As shown, the electronic device 10 includes at least one processor 11 and a memory, such as a read-only memory (ROM) 12 or a random access memory (RAM) 13, communicatively connected to the at least one processor 11. The memory stores computer programs executable by the at least one processor. The processor 11 can perform various appropriate actions and processes based on the computer program stored in the ROM 12 or loaded from storage unit 18 into the RAM 13. The RAM 13 may also store various programs and data required for the operation of the electronic device 10. The processor 11, ROM 12, and RAM 13 are interconnected via a bus 14. An input / output (I / O) interface 15 is also connected to the bus 14.
[0146] Multiple components in electronic device 10 are connected to I / O interface 15, including: input unit 16, such as keyboard, mouse, etc.; output unit 17, such as various types of displays, speakers, etc.; storage unit 18, such as disk, optical disk, etc.; and communication unit 19, such as network card, modem, wireless transceiver, etc. Communication unit 19 allows electronic device 10 to exchange information / data with other devices through computer networks such as the Internet and / or various telecommunications networks.
[0147] Processor 11 can be a variety of general-purpose and / or special-purpose processing components with processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various special-purpose artificial intelligence (AI) computing chips, various processors running machine learning model algorithms, a digital signal processor (DSP), and any suitable processor, controller, microcontroller, etc. Processor 11 performs the various methods and processes described above, such as gesture recognition methods.
[0148] In some embodiments, the gesture recognition method may be implemented as a computer program tangibly contained in a computer-readable storage medium, such as storage unit 18. In some embodiments, part or all of the computer program may be loaded and / or installed on electronic device 10 via ROM 12 and / or communication unit 19. When the computer program is loaded into RAM 13 and executed by processor 11, one or more steps of the gesture recognition method described above may be performed. Alternatively, in other embodiments, processor 11 may be configured to perform the gesture recognition method by any other suitable means (e.g., by means of firmware).
[0149] Various embodiments of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), application-specific standard products (ASSPs), systems-on-a-chip (SoCs), payload-programmable logic devices (CPLDs), computer hardware, firmware, software, and / or combinations thereof. These various embodiments may include implementations in one or more computer programs that can be executed and / or interpreted on a programmable system including at least one programmable processor, which may be a dedicated or general-purpose programmable processor, capable of receiving data and instructions from a storage system, at least one input device, and at least one output device, and transmitting data and instructions to the storage system, the at least one input device, and the at least one output device.
[0150] Computer programs used to implement the methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general-purpose computer, a special-purpose computer, or other programmable data processing device, such that when executed by the processor, the computer programs cause the functions / operations specified in the flowcharts and / or block diagrams to be performed. The computer programs may be executed entirely on a machine, partially on a machine, or as a standalone software package, partially on a machine and partially on a remote machine, or entirely on a remote machine or server.
[0151] In the context of this invention, a computer-readable storage medium can be a tangible medium that may contain or store a computer program for use by or in conjunction with an instruction execution system, apparatus, or device. A computer-readable storage medium may include, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination thereof. Alternatively, a computer-readable storage medium may be a machine-readable signal medium. More specific examples of machine-readable storage media include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fibers, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination thereof.
[0152] To provide interaction with a user, the systems and techniques described herein can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user; and a keyboard and pointing device (e.g., a mouse or trackball) through which the user provides input to the electronic device. Other types of devices can also be used to provide interaction with the user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form (including sound input, voice input, or tactile input).
[0153] The systems and technologies described herein can be implemented in computing systems that include backend components (e.g., as data servers), or computing systems that include middleware components (e.g., application servers), or computing systems that include frontend components (e.g., user computers with graphical user interfaces or web browsers through which users can interact with implementations of the systems and technologies described herein), or any combination of such backend, middleware, or frontend components. The components of the system can be interconnected via digital data communication of any form or medium (e.g., communication networks). Examples of communication networks include local area networks (LANs), wide area networks (WANs), blockchain networks, and the Internet.
[0154] A computing system can include clients and servers. Clients and servers are generally located far apart and typically interact through communication networks. The client-server relationship is created by computer programs running on the respective computers and having a client-server relationship with each other. The server can be a cloud server, also known as a cloud computing server or cloud host, which is a hosting product within the cloud computing service system to address the shortcomings of traditional physical hosts and VPS services, such as high management difficulty and weak business scalability.
[0155] It should be understood that the various forms of processes shown above can be used, with steps reordered, added, or deleted. For example, the steps described in this invention can be executed in parallel, sequentially, or in different orders, as long as the desired result of the technical solution of this invention can be achieved, and this is not limited herein.
[0156] The specific embodiments described above do not constitute a limitation on the scope of protection of this invention. Those skilled in the art should understand that various modifications, combinations, sub-combinations, and substitutions can be made according to design requirements and other factors. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of this invention should be included within the scope of protection of this invention.
Claims
1. A method for recognizing hand gestures, characterized in that, include: Determine the human target point cloud data, and determine the human tracking body based on the human target point cloud data; If the human target action conditions are determined to be met based on the human body tracking body, then the action point cloud data and stable point cloud data are determined based on the human target point cloud data of each frame that meets the human target action conditions. Based on the motion point cloud data and the stable point cloud data, determine the motion detection feature vector and the gesture recognition feature vector; If the dynamic action condition is determined to be met based on the action detection feature vector, then body action recognition and gesture action recognition are performed based on the gesture recognition feature vector. The determination of the conditions for satisfying the human target action based on the human body tracking device includes: Identify the target human body that meets the first preset distance condition of the radar; Based on the target human body tracking object, determine the target human body spatial cube; The target human body space cube includes all human target point cloud data of the target human body tracking body; If it is determined that the side length of the target human body space cube in the current frame is greater than or equal to the side length threshold, then the current frame is determined to meet the human body target action condition. Specifically, based on the human target point cloud data of each frame that meets the human target action conditions, motion point cloud data and stable point cloud data are determined, including: Determine the target human space cube for each frame that satisfies the target human action conditions; If the difference between the side length of the target human space cube corresponding to the target frame and the side length of the target human space cube corresponding to the adjacent frame of the target frame is greater than or equal to a preset difference threshold, then the target human space cube corresponding to the target frame will be deleted. Based on the human target point cloud data corresponding to the other target human space cubes after deletion, determine the motion point cloud data and the stable point cloud data; Specifically, based on action point cloud data and stable point cloud data, action detection feature vectors and gesture recognition feature vectors are determined, including: The action detection feature vector is determined using the following formula: ; in, This represents the action detection feature vector. This represents the projection of the velocity of the motion point cloud data onto the x-axis. This represents the projection of the velocity of the motion point cloud data onto the z-axis. This represents the projection of stable point cloud data onto the x-axis. This represents the projection of stable point cloud data onto the z-axis. The gesture recognition feature vector is determined using the following formula: ; in, This represents the gesture recognition feature vector. Indicates the speed of stable point cloud data. Indicates the speed of the action point cloud data. Indicates the height of the action point cloud data. This indicates the height of the stable point cloud data.
2. The method according to claim 1, characterized in that, Determine the motion point cloud data and the stable point cloud data, including: Determine the point cloud data of human targets at each altitude layer that meet the second preset distance condition with the radar; For human target point cloud data at the target height level, human target point cloud data that meets the third preset distance condition with the radar is determined as action point cloud data; Other human target point cloud data besides motion point cloud data are used as stable point cloud data.
3. The method according to claim 1, characterized in that, Determining whether a dynamic action condition is met based on the action detection feature vector includes: Dynamic action detection is performed based on the action detection feature vector; If it is determined that the action detection feature vector can pass dynamic action detection, then the dynamic action condition is satisfied.
4. The method according to claim 1, characterized in that, Body movement recognition and gesture recognition are performed based on the gesture recognition feature vector, including: The gesture recognition feature vector is input into the action recognition model pre-trained based on a deep neural network to perform body action recognition and gesture action recognition, and the body action recognition result and gesture action recognition result output by the action recognition model are obtained.
5. A gesture recognition device, characterized in that, include: The human body tracking body determination module is used to determine human target point cloud data and determine the human body tracking body based on the human target point cloud data; The point cloud data separation module is used to determine the motion point cloud data and the stable point cloud data based on the human target point cloud data of each frame that meets the human target motion conditions if the human target motion conditions are determined to be met based on the human tracking body. The feature vector determination module is used to determine the action detection feature vector and the gesture recognition feature vector based on the action point cloud data and the stable point cloud data. The gesture recognition module is used to perform body motion recognition and gesture recognition based on the gesture recognition feature vector if it is determined from the motion detection feature vector that the dynamic motion condition is met. The point cloud data separation module includes: Target Human Body Tracking Entity Determination Unit: Used to determine target human body tracking entities that meet the first preset distance condition with the radar; Target human body spatial cube determination unit: used to determine the target human body spatial cube based on the target human body tracking body; The target human body space cube includes all human target point cloud data of the target human body tracking body; Human target action condition determination unit: If the side length of the target human space cube in the current frame is greater than or equal to the side length threshold, then the current frame satisfies the human target action condition. The point cloud data separation module includes: Target human body space cube determination unit for each frame: used to determine the target human body space cube for each frame that meets the human body target action conditions; Noise reduction unit: If the difference between the side length of the target human space cube corresponding to the target frame and the side length of the target human space cube corresponding to the adjacent frame of the target frame is greater than or equal to a preset difference threshold, then the target human space cube corresponding to the target frame will be deleted. Point cloud data determination unit: used to determine motion point cloud data and stable point cloud data based on the human target point cloud data corresponding to the other target human space cubes after deletion; The feature vector determination module includes: Action detection feature vector determination unit: used to determine the action detection feature vector using the following formula: ; in, This represents the action detection feature vector. This represents the projection of the velocity of the motion point cloud data onto the x-axis. This represents the projection of the velocity of the motion point cloud data onto the z-axis. This represents the projection of stable point cloud data onto the x-axis. This represents the projection of stable point cloud data onto the z-axis. Gesture recognition feature vector determination unit: used to determine the gesture recognition feature vector using the following formula: ; in, This represents the gesture recognition feature vector. Indicates the speed of stable point cloud data. Indicates the speed of the action point cloud data. Indicates the height of the action point cloud data. This indicates the height of the stable point cloud data.
6. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the program, it implements the gesture recognition method as described in any one of claims 1-4.
7. A storage medium for storing computer-executable instructions, characterized in that, The computer-executable instructions, when executed by a computer processor, are used to perform the gesture recognition method as described in any one of claims 1-4.