A human skeleton estimation method, system and product based on radar

By generating a four-dimensional point cloud of the target and dividing it into human body topological point cloud blocks, the accuracy and robustness issues of radar human body attitude estimation are solved. This enables the improvement of the accuracy and robustness of human body attitude prediction during stable operation in all weather conditions, making it suitable for locations with high security and privacy requirements.

CN121817844BActive Publication Date: 2026-06-19SHENZHEN UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SHENZHEN UNIV
Filing Date
2026-03-11
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing radar-based human pose estimation methods have poor accuracy and robustness, especially in millimeter-wave radar point clouds which are naturally sparse, noisy, and have discontinuous shapes in local areas, making it difficult to construct implicit human topological priors and associate different joint regions.

Method used

By performing target detection on radar echo data to generate a four-dimensional point cloud of the target, the human body topology point cloud is divided into blocks based on the similarity of velocity direction, and then input into the human skeleton estimation network to estimate the human skeleton, thus constructing a priori human body topology structure and enhancing the model's ability to model the spatial relationships of key parts of the human body.

Benefits of technology

It achieves improved accuracy and robustness in human posture prediction during stable operation around the clock, is suitable for places with high security and privacy requirements, adapts to diverse and natural daily human behaviors, and has privacy-friendly characteristics.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN121817844B_ABST
    Figure CN121817844B_ABST
Patent Text Reader

Abstract

This application provides a radar-based human skeleton estimation method, system, and product. The method includes: performing target detection on acquired radar echo data and generating a four-dimensional point cloud of the target containing distance, velocity, and angle information based on the target detection results; dividing the target four-dimensional point cloud into multiple human topological point cloud blocks based on the similarity of the velocity directions of each point in the target four-dimensional point cloud; and inputting the human topological point cloud blocks into a human skeleton estimation network for human skeleton estimation to obtain the human skeleton estimation result. The human skeleton estimation method proposed in this application can construct human topological priors from sparse millimeter-wave radar point clouds, extract structure-aware point cloud block features, enhance the model's ability to model the spatial relationships of key human body parts, adapt to diverse and natural daily human behaviors in non-intrusive monitoring scenarios, achieve more generalizable human skeleton estimation, and effectively improve the accuracy and robustness of human posture prediction.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of radar technology, and in particular to a radar-based method, system, and product for estimating human skeletons. Background Technology

[0002] With the increasing demand for routine health monitoring, abnormal behavior recognition, and fall warning, continuous and natural perception and analysis of human posture and movement in the home environment can enable early identification of risks such as declining mobility and gait abnormalities, and provide timely warnings and auxiliary responses in emergencies. However, due to limitations in professional care costs and the increasing number of people living alone, there is an urgent need for a long-term monitoring technology that can ensure safety while protecting privacy. Among these technologies, human skeleton estimation, as a crucial foundational technology for describing human posture and movement, has significant application value in scenarios such as behavioral understanding, health assessment, and safety monitoring.

[0003] Currently, there are many types of sensors in use, mainly including: wearable devices, which collect data from inertial measurement units and other sources by wearing sensors on specific parts of the human body to infer the human skeleton, but these rely on long-term wear by the user and are susceptible to calibration errors and drift; visual and infrared sensors, which capture images of the human appearance or thermal distribution information and perform key point detection and skeleton reconstruction, but are highly sensitive to lighting, occlusion, and environmental conditions, and pose privacy risks; Wi-Fi radio frequency devices, which estimate the human skeleton by mapping wireless channel information to human spatial structure information, but are limited by the bandwidth and spatial resolution of the communication system itself, and are susceptible to multipath effects and environmental changes; and millimeter-wave radar, which transmits millimeter waves and receives their reflected signals, using distance, Doppler, and angle information to obtain the distribution of human body scattering points, and has advantages such as non-contact, anti-occlusion, and privacy-friendly. However, millimeter-wave radar point clouds are naturally sparse, noisy, and have discontinuous shapes in local areas, lacking explicit geometric and topological connections between body parts, making it difficult to construct implicit human topological priors without explicit skeleton labels and to use point cloud blocks to model the relationships between different joint regions of the human body topology. Summary of the Invention

[0004] The purpose of this application is to provide a radar-based human skeleton estimation method, system, and product, which can at least solve the problems of poor accuracy and robustness of radar-based human pose estimation methods in related technologies.

[0005] To address the aforementioned technical problems, the first aspect of this application provides a radar-based human skeleton estimation method, comprising:

[0006] Target detection is performed on the collected radar echo data, and a four-dimensional point cloud of the target containing distance, velocity and angle information is generated based on the target detection results.

[0007] Based on the similarity of the velocity directions of each point in the target four-dimensional point cloud, the target four-dimensional point cloud is divided into multiple human body topology point cloud blocks;

[0008] The human body topology point cloud blocks are input into the human skeleton estimation network to perform human skeleton estimation and obtain the human skeleton estimation result.

[0009] A second aspect of this application provides a radar-based human skeleton estimation system, comprising:

[0010] The point cloud generation module is used to perform target detection on the collected radar echo data and generate a four-dimensional point cloud of the target containing distance, velocity and angle information based on the target detection results.

[0011] The point cloud partitioning module is used to divide the target four-dimensional point cloud into multiple human body topology point cloud blocks based on the similarity of the velocity directions of each point in the target four-dimensional point cloud.

[0012] The skeleton estimation module is used to input the human body topology point cloud blocks into the human body skeleton estimation network to perform human body skeleton estimation and obtain the human body skeleton estimation result.

[0013] A third aspect of this application provides a radar, including a memory and a processor, wherein the processor is configured to execute a computer program stored in the memory, and when the processor executes the computer program, it implements the steps in the human skeleton estimation method described in the first aspect of the embodiments of this application.

[0014] The fourth aspect of this application provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the human skeleton estimation method described in the first aspect of the embodiments of this application.

[0015] As can be seen from the above, the embodiments of this application first perform target detection on the collected radar echo data, and generate a target four-dimensional point cloud containing distance information, velocity information and angle information based on the target detection results. Then, based on the similarity of the velocity direction of each point in the target four-dimensional point cloud, the target four-dimensional point cloud is divided into multiple human body topology point cloud blocks. Finally, the human body topology point cloud blocks are input into the human skeleton estimation network to perform human skeleton estimation and obtain the human skeleton estimation result. Compared to other human skeleton estimation methods based on sensors, the proposed human skeleton estimation method is privacy-friendly and unaffected by lighting and weather conditions, enabling stable operation around the clock and making it suitable for environments with high security and privacy requirements. Furthermore, this method can construct a human topological prior from sparse millimeter-wave radar point clouds, extracting structure-aware point cloud block features to enhance the model's ability to model the spatial relationships of key human body parts. Moreover, compared to existing methods that often rely on datasets with limited behavioral types and scale, this method can adapt to diverse and natural daily human behaviors in non-intrusive monitoring scenarios, achieving more generalizable human skeleton estimation and effectively improving the accuracy and robustness of human posture prediction.

[0016] It should be understood that the description in this section is not intended to identify key or important features of this application, nor is it intended to limit the scope of this application. Other features of this application will become readily apparent from the following description. Attached Figure Description

[0017] To more clearly illustrate the related technologies or the technical solutions in the embodiments of this application, the drawings used in the description of the related technologies or the embodiments of this application will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application, and not all embodiments. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0018] Figure 1 A flowchart illustrating the human skeleton estimation method provided in this application embodiment;

[0019] Figure 2 The image shown is a real skeleton image captured by a depth camera when a human body is performing a kicking motion in the human skeleton estimation method provided in this application embodiment.

[0020] Figure 3 A three-dimensional point cloud clustering result diagram of the human skeleton estimation method provided in the embodiments of this application when the human body is performing a kicking motion;

[0021] Figure 4 The image shows a real skeleton image captured by a depth camera when a human body is raising both hands in the human skeleton estimation method provided in this application embodiment.

[0022] Figure 5 A three-dimensional point cloud clustering result diagram of a human body performing a hand-raising action in the human skeleton estimation method provided in the embodiments of this application;

[0023] Figure 6 A schematic diagram illustrating the concept of loss constraints in the human skeleton estimation method provided in this application embodiment;

[0024] Figure 7 A schematic diagram of a human skeleton estimation network framework in the human skeleton estimation method provided in the embodiments of this application;

[0025] Figure 8 A detailed flowchart illustrating the human skeleton estimation method provided in the embodiments of this application;

[0026] Figure 9 This is a schematic diagram of the program modules of the human skeleton estimation system provided in the embodiments of this application;

[0027] Figure 10 A block diagram of a radar provided in an embodiment of this application;

[0028] Figure 11 A block diagram of a computer-readable storage medium provided in an embodiment of this application. Detailed Implementation

[0029] To make the objectives, technical solutions, and advantages of this application more apparent and understandable, this application will be clearly and completely described below in conjunction with its embodiments and accompanying drawings. Throughout, the same or similar reference numerals denote the same or similar elements or elements having the same or similar functions. It should be understood that the various embodiments of this application described below are merely illustrative of this application and are not intended to limit this application. That is, all other embodiments obtained by those skilled in the art based on the various embodiments of this application without creative effort are within the scope of protection of this application. Furthermore, the technical features involved in the various embodiments of this application described below can be combined with each other as long as they do not conflict with each other.

[0030] With the increasing demand for routine health monitoring, abnormal behavior recognition, and fall warning, continuous and natural perception and analysis of human posture and movement in the home environment can enable early identification of risks such as declining mobility and gait abnormalities, and provide timely warnings and auxiliary responses in emergencies. However, due to limitations in professional care costs and the increasing number of people living alone, there is an urgent need for a long-term monitoring technology that can ensure safety while protecting privacy. Among these technologies, human skeleton estimation, as a crucial foundational technology for describing human posture and movement, has significant application value in scenarios such as behavioral understanding, health assessment, and safety monitoring.

[0031] Currently, there are many types of sensors in use, mainly including: wearable devices, which collect data from inertial measurement units and other sources by wearing sensors on specific parts of the human body to infer the human skeleton, but these rely on long-term wear by the user and are susceptible to calibration errors and drift; visual and infrared sensors, which capture images of the human appearance or thermal distribution information and perform key point detection and skeleton reconstruction, but are highly sensitive to lighting, occlusion, and environmental conditions, and pose privacy risks; Wi-Fi radio frequency devices, which estimate the human skeleton by mapping wireless channel information to human spatial structure information, but are limited by the bandwidth and spatial resolution of the communication system itself, and are susceptible to multipath effects and environmental changes; and millimeter-wave radar, which transmits millimeter waves and receives their reflected signals, using distance, Doppler, and angle information to obtain the distribution of human body scattering points, and has advantages such as non-contact, anti-occlusion, and privacy-friendly. However, millimeter-wave radar point clouds are naturally sparse, noisy, and have discontinuous shapes in local areas, lacking explicit geometric and topological connections between body parts, making it difficult to construct implicit human topological priors without explicit skeleton labels and to use point cloud blocks to model the relationships between different joint regions of the human body topology.

[0032] In summary, this application provides a method for estimating the human skeleton. For details, please refer to [link / reference needed]. Figure 1 , Figure 1 This is a flowchart illustrating the radar-based human skeleton estimation method provided in an embodiment of this application. The human skeleton estimation method includes the following steps 101 to 103.

[0033] Step 101: Detect targets in the collected radar echo data and generate a four-dimensional point cloud of the target containing distance, velocity and angle information based on the target detection results.

[0034] First, radar echo data is collected and target detection is performed. Effective signal components related to the human target are then selected, and a four-dimensional point cloud of the target is generated based on this detection result. The four-dimensional point cloud contains at least distance, velocity, and angle information, comprehensively depicting the position and motion characteristics of the human target in space.

[0035] In some embodiments of this application, before performing target detection on the collected radar echo data, the method further includes: performing signal preprocessing on the collected radar echo signal to obtain a multidimensional echo signal containing information in the range dimension, velocity dimension, and angle dimension.

[0036] Specifically, a multi-channel millimeter-wave radar continuously transmits electromagnetic waves into the monitoring space and receives echo signals reflected from objects. After signal preprocessing such as signal amplification, mixing, and analog-to-digital conversion, a multi-dimensional echo signal containing information in the range, velocity, and angle dimensions is obtained. Subsequently, power spectrum analysis is performed on the multi-dimensional echo signal, and target detection algorithms such as constant false alarm rate are used to determine the existence of the target and parameters such as range and velocity. For the effective target points detected, information such as angle and distance is obtained.

[0037] Furthermore, in some embodiments of this application, the acquired radar echo signal is preprocessed to obtain a multidimensional echo signal containing range, velocity, and angle dimensions. This includes: preprocessing the acquired raw radar echo signal to obtain a first discrete echo signal. ;in, For fast time dimension, it means the first... One sampling point, Let be the slow time dimension, indicating the th A linear frequency modulated continuous wave signal Let be the antenna dimension, and let represent the th . The received signals from each channel are analyzed; based on the first discrete echo signal, a fast Fourier transform is performed in the fast time dimension to obtain the second discrete echo signal. The second discrete echo signal is a range-slow time-antenna dimension signal. This indicates range cell sampling; range-dimensional clutter suppression and coherent accumulation are performed based on the second discrete echo signal to obtain the initial multidimensional echo signal. The final multidimensional echo signal is obtained by performing a Fast Fourier Transform (FFT) on the initial multidimensional echo signal in the slow time dimension. Among them, the multidimensional echo signal is a range-Doppler-antenna dimensional signal. , indicating Doppler unit sampling.

[0038] Specifically, firstly, the radar continuously transmits electromagnetic wave signals into space. After being reflected by objects, the signals are received by the radar receiver. Then, after being sampled by a signal amplifier, mixer, and ADC, a discrete echo signal containing the distance, velocity, and angle dimensions is obtained, which is the first discrete echo signal.

[0039] The first discrete echo signal can be represented as: ,in, For fast time dimension, it means the first... One sampling point; Let be the antenna dimension, and let represent the th . The received signals of each channel.

[0040] Then, based on the first discrete echo signal, a Fast Fourier Transform is first performed in the fast time dimension to convert the signal from the time domain to the frequency domain, obtaining the second discrete echo signal. The second discrete echo signal, also known as the range-slow time-antenna dimension signal, can be represented as... ,in, , indicating distance cell sampling.

[0041] Furthermore, range-dimensional clutter suppression and coherent accumulation are performed on the second discrete echo signal. Environmental noise and non-target echo interference are filtered out using methods such as moving average algorithms or time-domain accumulation, resulting in the multidimensional signal after clutter suppression and coherent accumulation, i.e., the initial multidimensional echo signal, which can be expressed as: .

[0042] Finally, a Fast Fourier Transform is performed on the initial multidimensional echo signal in the slow time dimension to further extract the target velocity information, resulting in the final multidimensional echo signal, i.e., the range-Doppler-antenna dimension signal, which can be expressed as: ,in , indicating Doppler unit sampling.

[0043] It is understood that the multi-channel millimeter-wave radar selected in the embodiments of this application can be either a single-transmitter-multiple-receiver radar or a multi-transmitter-multiple-receiver radar, and this application does not make any specific limitation on it.

[0044] In this embodiment of the application, target detection is performed on the collected radar echo data, and a four-dimensional point cloud of the target containing range information, velocity information, and angle information is generated based on the target detection results. This includes: performing Fast Fourier Transform processing on the collected single-frame radar echo data to obtain short-timescale point cloud information; performing Non-Uniform Fourier Transform processing on the collected multi-frame radar echo data to obtain long-timescale point cloud information; performing target detection based on the short-timescale point cloud information and the long-timescale point cloud information, and generating a four-dimensional point cloud of the target containing range information, velocity information, and angle information based on the target detection results.

[0045] Specifically, considering that the motion states of various skeletal joints in the human body not only differ at the same moment but also exhibit significant dynamic changes at different time scales, if the same observation time window and coherent integration strategy are used for point cloud generation for both fast-moving and slow-moving joints, it is easy to cause Doppler information broadening of high-speed moving joints or feature submersion of low-speed moving joints, resulting in the loss of information for some key joints. Therefore, this application introduces a multi-time-scale point cloud generation mechanism.

[0046] First, a short-timescale point cloud generation operation is performed. For radar echo data acquired in a single frame, after down-conversion and analog-to-digital conversion to obtain a three-dimensional cubic signal, the focus is on capturing features of fast-moving joints in the human body (such as arm swinging and leg kicking). A Fast Fourier Transform (FFT) is directly performed on the Doppler dimension of this single frame of data. This processing method can quickly extract point cloud information reflecting the instantaneous motion characteristics of the target, avoiding the broadening of Doppler information of high-speed moving joints due to excessively long observation time windows, and finally obtaining short-timescale point cloud information. Its core advantage lies in the accurate capture of instantaneous dynamic features.

[0047] Secondly, long-term point cloud generation is performed. For radar echo data acquired in multiple consecutive frames, considering that the features of slowly moving joints in the human body (such as trunk swaying or lower limbs during slow walking) are easily submerged by noise, it is necessary to improve the Doppler resolution to enhance the characterization of motion continuity. Therefore, after fusing the Doppler dimensions of the multiple consecutive frames of data, a Non-uniform Fast Fourier Transform (NUFFT) is used for signal analysis. This transform can adapt to the non-uniform distribution characteristics of the Doppler dimensions of the multiple frames of data, effectively improving the feature recognition of low-speed moving targets, and finally obtaining long-term point cloud information.

[0048] Finally, target detection is performed based on the two types of point cloud information mentioned above, generating a four-dimensional point cloud of the target. Short-timescale point cloud information focuses on instantaneous motion features, while long-timescale point cloud information emphasizes high-resolution steady-state features; the two complement each other to form a comprehensive target feature representation. By jointly performing target detection on the two types of point cloud information, effective signal components related to the human target are screened out, environmental noise and clutter interference are eliminated, and finally, a four-dimensional point cloud of the target is obtained, simultaneously containing radial distance, velocity, and angle information. This four-dimensional point cloud of the target can accurately depict the dynamic changes of fast-moving joints in the human body and clearly present the spatial position of slow-moving joints. By fusing four-dimensional point clouds generated at different time scales, a multi-timescale point cloud representation that takes into account both fast and slow motion features can be obtained.

[0049] In this embodiment, target detection is performed based on short-term and long-term point cloud information, and a four-dimensional point cloud containing range, velocity, and angle information is generated based on the target detection results. This includes: performing target detection on short-term and long-term point cloud information using a constant false alarm rate (CFAR) algorithm to obtain target detection results; wherein the target detection results include valid target points; extracting the range-Doppler index of the valid target points; wherein the range-Doppler index includes radial range and radial velocity information of the valid target points; performing beamforming based on the short-term and long-term point cloud information corresponding to the range-Doppler index to extract the angle index of the valid target points; wherein the angle index includes azimuth and elevation angle information of the valid target points; establishing a spatial coordinate system based on the radar installation method, and obtaining the four-dimensional coordinates of the valid target points based on the range-Doppler index and the angle index, thereby generating a four-dimensional point cloud containing range, velocity, and angle information.

[0050] First, a valid target point filtering operation is performed. For both the generated short-term and long-term point clouds, a constant false alarm rate (CFAR) detection algorithm is used for target detection. This algorithm dynamically generates a detection threshold by statistically analyzing the noise power of local neighborhoods in the point cloud data. Points with signal power higher than the threshold are identified as valid target points, while points with power lower than the threshold are identified as environmental noise or clutter and are discarded. Through this operation, a set of valid target points related to human targets can be filtered from the two types of time-scale point clouds.

[0051] Secondly, the range-Doppler index of the effective target points is extracted. The range-Doppler index is a key parameter characterizing the core motion and position information of the effective target points. Its extraction is based on the preprocessing results of two types of point clouds: short-timescale point clouds and long-timescale point clouds are processed by range dimension FFT and Doppler dimension FFT / NUFFT to form feature maps containing distance and velocity information. The radial distance information and radial velocity information corresponding to each effective target point are directly extracted from this map, thus completing the construction of the range-Doppler index.

[0052] Next, angle estimation is performed based on the range-Doppler index. For each valid target point, the signal corresponding to the antenna channel dimension is extracted from the 3D data cube signals of the short-timescale and long-timescale point clouds according to its range-Doppler index. The antenna channel signal is processed using a digital beamforming (DBF) algorithm. By weighted synthesis of the received signals from multiple antennas, the direction of arrival of the valid target point is estimated, thus obtaining an angle index containing azimuth and elevation information. This angle index, in conjunction with the range-Doppler index, comprehensively characterizes the spatial position and motion state of the valid target point.

[0053] Finally, a spatial coordinate system is established and a target four-dimensional point cloud is generated. Based on the radar's installation method (e.g., side-mounted on a bracket or wall), a three-dimensional spatial coordinate system adapted to the monitoring scenario is established. The range-Doppler index (radial range, radial velocity) and angle index (azimuth, elevation) of each effective target point are substituted into this coordinate system, and the four-dimensional coordinates of the effective target points are calculated through coordinate transformation. The four-dimensional coordinates of all effective target points are integrated to generate a target four-dimensional point cloud containing range, velocity, and angle information. This target four-dimensional point cloud combines the advantages of two types of time-scale point clouds, and through precise index extraction and angle estimation, the accuracy of information in each dimension is ensured.

[0054] Specifically, in some embodiments of this application, target detection is actually based on the multidimensional echo signal, including: acquiring the Doppler power spectrum of the multidimensional echo signal; extracting peaks from the Doppler power spectrum based on the constant false alarm rate algorithm, and performing target detection on the extracted peak points to obtain the target detection result; wherein, the target detection result includes valid target points.

[0055] Specifically, multidimensional echo signals include the distance dimension. Speed ​​dimension and antenna dimensions Information, taking distance-Doppler-antenna dimensional signals The power spectrum of the signal can be used to obtain the Doppler power spectrum. By analyzing the Doppler power spectrum of the signal, the presence of a target within the radar detection range and the target's distance and velocity information can be directly reflected. Then, the peak values ​​in the power spectrum are extracted using constant false alarm rate (CFAR) algorithms. CFAR algorithms specifically include Cell-Averaging Constant False Alarm Rate (CA-CFAR) and Ordered Statistics Constant False Alarm Rate (OS-CFAR). By setting a detection threshold that matches the intensity of environmental clutter, peak points corresponding to the true target are selected, noise and interference are eliminated, and valid target points are obtained, thereby determining the existence of the target and its preliminary position and velocity parameters.

[0056] Specifically, the first step is to extract the distance-Doppler of the effective target points. An index is generated, corresponding to the range and velocity characteristics of the target in the multidimensional echo signal. Then, the index corresponding to the above index is retrieved. The antenna-dimensional signal is used for beamforming, and the peak power spectrum index after beamforming is taken as the angle index corresponding to the effective points. Thus, the distance, velocity, angle, and energy information of each effective point of the target are known, enabling precise target location. Commonly used beamforming methods include Digital Beamforming (DBF), Multiple Signal Classification (MUSIC), and Minimum Variance Distortionless Response (MVDR) algorithm (also known as the Capon algorithm). Finally, a spatial coordinate system is established according to the millimeter-wave radar installation method (the spatial coordinate system varies depending on the installation method; common installation methods include side mounting at 45°, side mounting, top mounting, etc.). In the spatial coordinate system, the coordinates of all effective points of the target, i.e., the coordinates of all point cloud imaging points, can be obtained through the distance and angle information. All coordinate points constitute the target point cloud, intuitively representing the spatial distribution of the human body within the radar detection range.

[0057] Step 102: Based on the similarity of the velocity directions of each point in the target four-dimensional point cloud, divide the target four-dimensional point cloud into multiple human body topology point cloud blocks.

[0058] Specifically, based on the similarity of the velocity directions of each point in the target four-dimensional point cloud, the target four-dimensional point cloud is topologically divided to obtain multiple human body topological point cloud blocks. During human movement, scattering points from the same limb have consistent velocity direction characteristics, while scattering points from different limbs have significantly different velocity directions. This method utilizes this human kinematic characteristic, and by measuring the similarity of the velocity directions of each point, adaptively divides the discrete sparse point cloud into point cloud blocks with implicit topological associations, thereby constructing a priori human body topology and solving the technical problem of sparse point clouds lacking explicit body part connections.

[0059] In this embodiment of the application, based on the similarity of the velocity directions of each point in the target four-dimensional point cloud, the target four-dimensional point cloud is divided into multiple human body topological point cloud blocks, including: performing unit normalization processing on the radial velocity information of the effective target points to obtain normalized velocity vectors; constructing a distance metric based on the cosine similarity of each normalized velocity vector, and performing clustering processing on the effective target points according to the distance metric to obtain clustering processing results; and dividing the effective target points based on the clustering processing results to obtain multiple human body topological point cloud blocks.

[0060] Specifically, for four-dimensional point cloud data generated at multiple time scales, this application further introduces a human body topological structure modeling mechanism. Inspired by human kinematics, considering the directional consistency of velocity vectors at each scattering point in the radar point cloud, the human body point cloud is topologically divided by measuring the similarity between velocity directions. Specifically, cosine similarity is used as a metric for velocity direction similarity, adaptively dividing the dynamic human body point cloud into five main topological regions, corresponding to the main torso, left upper limb, right upper limb, left lower limb, and right lower limb, respectively. For example, as shown... Figure 2 The image shown is a realistic skeletal image captured by a depth camera during a human kicking motion. Figure 3 The image shown is a 3D point cloud clustering result of a human kicking motion, as follows: Figure 4 The image shown is a realistic skeletal image captured by a depth camera when a human is raising both hands. Figure 5 The image shown is a 3D point cloud clustering result when a human raises both hands.

[0061] In detail, firstly, unit normalization processing of radial velocity information is performed. For the selected valid target points, the radial velocity information of each point is extracted and a velocity vector is constructed. This velocity vector is then normalized. The core purpose of this operation is to eliminate the differences in velocity amplitude (i.e., speed of movement) between different valid target points, retaining only the velocity direction information. When the human body moves, valid target points of the same limb (such as the left upper limb or the main trunk) naturally have the same direction of movement, while valid target points of different limbs have significantly different directions of movement. Normalization processing can highlight this core feature, providing a directional basis for subsequent similarity judgment, and finally obtaining a normalized velocity vector with a constant magnitude of 1 and a direction consistent with the original velocity vector.

[0062] Secondly, a distance metric is constructed based on cosine similarity, and clustering is performed. For the normalized velocity vectors of all valid target points, the cosine similarity between any two vectors is calculated. This similarity quantifies the degree of overlap in the velocity directions of two valid target points. A similarity closer to 1 indicates that the velocity directions of the two points are more consistent, and they are likely to belong to the same limb part; a similarity closer to 0 or -1 indicates a greater difference in velocity directions, and they are likely to belong to different limb parts. To adapt to the distance-minimization clustering logic of the clustering algorithm, cosine similarity is converted into a distance metric, and then the K-means clustering algorithm is used to cluster all valid target points, grouping valid target points with consistent velocity directions into the same cluster, thus obtaining the clustering results.

[0063] Finally, the human body topological point cloud is divided into blocks based on the clustering results. After clustering, each cluster corresponds to a human body part with implicit topological associations (such as the main torso cluster, the left upper limb cluster, etc.), and each cluster is treated as an independent human body topological point cloud block, thus completing the structured division of the target four-dimensional point cloud.

[0064] Understandably, in order to avoid limiting the model's expressive power by artificially setting fixed body part labels, this application only assigns an independent encoding identifier to each point cloud cluster, without imposing hard constraints on the specific human body parts it corresponds to. The potential topological semantic relationships are adaptively learned by the subsequent network during training.

[0065] In addition, in some embodiments of this application, in order to unify the data into a fixed-size tensor to adapt to the network input, for human topology point cloud blocks with fewer points than the preset standard, a linear interpolation operation between two points is performed according to the proportion of the original number of points in each cluster to fill in the preset number of points N, ensuring that all human topology point cloud blocks are fixed-size tensors.

[0066] Step 103: Input the human body topology point cloud blocks into the human skeleton estimation network to perform human skeleton estimation and obtain the human skeleton estimation results.

[0067] Specifically, the obtained human body topological point cloud blocks are input into a pre-trained human skeleton estimation network. The network learns and models the structured features of the topological point cloud blocks, and finally outputs the human skeleton estimation result. The human skeleton estimation network can fully explore the local topological information and global pose correlation contained in the point cloud blocks, improve the ability to depict the spatial relationships of key human joints, and thus ensure the accuracy and robustness of skeleton estimation.

[0068] In this embodiment, the human skeleton estimation network includes a point cloud block topology modeling module, a multi-level feature fusion module, and a skeleton regression module connected in sequence. The point cloud block topology modeling module is used to dynamically model the topological structure of the input human topological point cloud block and output point-level features with enhanced topological semantics. The multi-level feature fusion module is used to adaptively fuse point-level features with enhanced topological semantics, original point cloud geometric features, and global context features to output point cloud fused features. The skeleton regression module is used to map the point cloud fused features to human skeleton joint coordinates to obtain the human skeleton estimation result.

[0069] First, the point cloud block topology modeling module performs dynamic topology modeling. The input to this module is the partitioned human body topology point cloud blocks. While these point cloud blocks already possess implicit human body part relationships, they still lack explicit topological structural features. The module dynamically models the discrete point cloud blocks by mining the spatial geometric relationships and semantic associations within and between each block, strengthening the structured relationships between points and clusters, and between clusters. The final output is point-level features with enhanced topological semantics. These features retain the local geometric information of individual points while incorporating global topological constraints of human body parts, solving the technical challenge of depicting human body structural relationships using sparse point clouds.

[0070] Secondly, the multi-level feature fusion module performs adaptive feature fusion. This module receives three types of input features: first, point-level features with enhanced topological semantics output by the point cloud block topology modeling module; second, original point cloud geometric features directly extracted from the human body topological point cloud block, reflecting the spatial location attributes of the points; and third, pose-encoded global context features obtained through global pooling, reflecting the overall pose trend of the human body. Since the three types of features correspond to different semantic levels—local topology, original geometry, and global pose—the module balances the contribution of each type of feature through an adaptive fusion mechanism, avoiding the limitations of single feature representation, and ultimately outputting point cloud fusion features with more comprehensive information and stronger representational capabilities.

[0071] Finally, the skeleton regression module performs human skeleton joint coordinate mapping. This module takes the point cloud fusion features output by the multi-level feature fusion module as input, and transforms the high-dimensional fusion features into three-dimensional coordinates of human skeleton joints through nonlinear mapping and regression calculation. The role of this module is to establish the mapping relationship between fusion features and the spatial position of human joints. Through network training, it learns the correspondence between features and joint coordinates, and finally outputs a skeleton estimation result that accurately reflects human posture.

[0072] In this embodiment, the point cloud block topology modeling module includes: a geometry-aware soft clustering unit, used to classify points in the human body topology point cloud block into clusters using a soft allocation method and aggregate them to obtain cluster-level features; a cluster-level spatial geometry-aware attention unit, used to enhance cluster-level features based on the spatial geometric relationship between clusters through an attention mechanism; and a feature diffusion unit, used to diffuse the enhanced cluster-level features back to the point level based on the spatial proximity between points and clusters, generating point-level features with enhanced topological semantics.

[0073] Specifically, due to the characteristics of millimeter-wave radar point clouds, such as high sparsity, numerous noise points, and unstable human body scattering structure, it is not always possible to reliably divide the human body into a fixed number of clearly defined local topological regions in a single frame of point cloud. If a hard allocation method is used to cluster the point cloud, it is easy to cause local structural breaks or misallocation, thereby affecting the stability of subsequent feature modeling.

[0074] To address this, this application proposes a geometry-aware soft clustering method. Instead of making discrete decisions on the attribution relationship between point cloud samples and clusters, this method introduces a learnable adaptive parameter σ to assign continuous soft attribution weights to each point based on spatial geometric distance, thereby achieving flexible modeling of local structures of the human body.

[0075] In detail, the method includes:

[0076] (1) Initial input

[0077] The initial input consists of three parts: a point-level feature matrix extracted based on multi-scale dilated convolution. The corresponding set of three-dimensional spatial coordinates The initial cluster numbers at the point level obtained from process 2 are used for subsequent topology modeling and feature propagation.

[0078] (2) Cluster center initialization

[0079] Based on the initial coarse cluster numbers, the point cloud is initialized with cluster centers to obtain the geometric center of the c-th cluster:

[0080] ,

[0081] in, This is an indicator function used to indicate whether the nth point is assigned to the cth point cluster. The value is 1 when point n belongs to cluster c, and 0 otherwise.

[0082] (3) Calculation of distance-based soft-assignment weights

[0083] To avoid the instability caused by hard assignment, a learnable scale parameter is introduced. Calculate the soft affiliation weight based on the spatial distance between the point and the cluster center:

[0084] ,

[0085] in, To introduce an adaptive standard deviation, the standard deviation is dynamically predicted by the geometric features within each cluster through a small MLP that stacks several layers. This allows the model to flexibly adjust the strictness of clustering according to the actual distribution of the point cloud, thereby enhancing its robustness to noise and non-rigid deformation.

[0086] (4) Cluster-level feature aggregation:

[0087] Based on soft-assignment weights, point-level features are weighted and aggregated to construct cluster-level feature representations:

[0088] ,

[0089] And update the cluster center simultaneously:

[0090] .

[0091] Secondly, the cluster-level spatial geometry-aware attention unit performs cluster-level feature enhancement. This unit takes the cluster-level features output by the geometry-aware soft clustering unit and the updated cluster center coordinates as input, and its core function is to mine the topological relationships between human bodies in the clusters. Specifically, this includes:

[0092] (1) Modeling of inter-cluster geometric relationships

[0093] For any inter-cluster pair Based on inter-cluster spatial distance, geometric affinity weights are introduced to characterize the spatial proximity relationships between various topological parts of the human body:

[0094] ,

[0095] in, This is used to map the Euclidean distance between clusters to geometric affinity weights in the interval [0,1]. Clusters that are closer in distance have greater affinity weights.

[0096] (2) Constructing attention weights by introducing geometric priors

[0097] In a multi-head attention mechanism, among the h-th attention heads, the first attention head is... The feature vectors of each cluster are linearly mapped to obtain the transformed cluster features:

[0098] ,

[0099] Subsequently, the source cluster features, target cluster features, and their spatial orientation information are concatenated to construct a joint representation for attention modeling:

[0100] ,

[0101] Based on the above joint representation, the first... The cluster pairs the first Unnormalized geometrically enhanced attention-weighted score for each cluster:

[0102] ,

[0103] in, For the first Feature mapping matrix of each attention head; This is a learnable attention weight vector;

[0104] Geometric affinity weight As a subtractive modulation term, it suppresses inter-cluster relationships that do not conform to the spatial structure of the human body. By explicitly introducing geometric prior constraints based on spatial distance during the attention scoring stage, it can effectively reduce the transmission of erroneous information between distant and weakly associated clusters, and enhance the stable modeling ability of the local topology of the human body.

[0105] Based on this, the attention score is normalized to obtain the first... Attention weights under each attention head:

[0106] ,

[0107] (3) Cluster-level feature update based on multi-head attention

[0108] Based on attention weights, the features of adjacent clusters are weighted and aggregated to obtain the first... Output of each attention head:

[0109] ,

[0110] Here, C represents the C topological point cloud clusters of the human body. Subsequently, the outputs of each attention head are concatenated and mapped back to the original feature dimension to obtain the cluster-level topological enhancement features:

[0111] ,

[0112] in, For the output projection matrix, LN denotes the layer normalization operation.

[0113] Furthermore, to avoid the loss of point-level details due to cluster-level modeling, this application further diffuses cluster-level semantic information back to the point-level space. This mechanism uses the spatial distance between a point and the geometric centers of each cluster as a basis, and measures the influence of different clusters on that point using a soft-weighting method, thereby achieving weighted fusion of multi-cluster information. Specifically, this includes:

[0114] (1) Calculation of point-cluster spatial weights:

[0115] For any point with cluster The diffusion weight is calculated based on the Euclidean distance between the two in three-dimensional space, and is defined as follows:

[0116] ,

[0117] in, This represents the spatial coordinates of the nth point. Indicates the first The geometric center of each cluster; These are scale control parameters, similar to those in the aforementioned geometric perception soft clustering module. The parameters are kept consistent to ensure the consistency and stability of spatial weight calculation.

[0118] (2) Cluster-level features diffuse to point-level features

[0119] Based on the above weights, the cluster-level features are diffused to the point-level space to obtain the enhanced point features:

[0120] ,

[0121] in, Indicates the first Aggregation features of enhanced topological clusters The total number of clusters. Through the cluster-point feature diffusion process described above, the point-level representation not only preserves local geometric information but also explicitly integrates global semantic constraints from different topological parts of the human body, thereby improving the ability to model sparse point clouds and complex human poses.

[0122] Furthermore, in this embodiment, to fully integrate feature information from different levels and semantic sources, this application proposes a multi-branch feature adaptive fusion method based on global awareness gating, applied to a multi-level feature fusion module. This method adaptively learns the relative importance of the three features by globally statistically analyzing their overall distribution characteristics, and accordingly completes weighted feature fusion. Details:

[0123] In terms of point features, the method has three input branches: 1. Pointwise geometric features encoded by multi-scale dilated convolution; 2. Local semantic features containing human topology from the diffusion module; 3. Pose-encoded global context embedding features obtained through global pooling.

[0124] First, the global awareness gating weights are generated, and global average pooling is performed on the three features to construct a channel-level global description vector. Subsequently, through a gating mapping consisting of two layers of fully connected networks, and then... Normalization generates fusion weights for the three features:

[0125] ,

[0126] in, This represents a lightweight mapping function connected by non-linear activation functions, used to characterize the relative importance of different feature branches in the current sample.

[0127] Then, based on the above fusion weights, the three point-level features are weighted and summed to obtain the fused point-level feature representation:

[0128] ,

[0129] This fusion process achieves dynamic balance of multi-level features while maintaining the point-level spatial distribution. The fused point-level features are further enhanced through linear mapping, layer normalization, and nonlinear activation operations, thereby improving the stability of feature representation and nonlinear modeling capabilities.

[0130] In this embodiment of the application, before inputting the human body topology point cloud block into the human skeleton estimation network for human skeleton estimation, the method further includes: training the original human skeleton estimation network based on the joint loss function to obtain an optimized human skeleton estimation network; wherein, the joint loss function includes a first loss term and a second loss term, the first loss term is used to constrain the human body joint coordinate error, and the second loss term is used to constrain the coordinate error of each part of the human body.

[0131] Specifically, a joint loss function is constructed. This joint loss function consists of a first loss term and a second loss term, which respectively impose dual constraints on the errors in human joint coordinates and the coordinate errors of various body parts. The first loss term (human joint coordinate error constraint) focuses on the difference in coordinate prediction accuracy between critical and non-critical joints. It assigns higher weights to important joints (such as the right shoulder, right wrist, left shoulder, and left wrist—joints with active movement and crucial for posture representation) and relatively lower weights to non-important joints. By calculating the Euclidean distance error between the predicted and actual joint coordinates, it constrains the accuracy of joint-level position prediction, ensuring the estimation accuracy of critical joints. The second loss term (coordinate error constraint of various body parts) considers the variation in absolute joint coordinates in the dataset, making precise constraints difficult. It divides the human body into five parts: the trunk, left upper limb, right upper limb, left lower limb, and right lower limb. A corresponding reference root node is selected for each part, and the relative position error between each joint and the root node within that part is calculated. Simultaneously, considering the differences in motion amplitude and posture stability among different parts, normalized weights are assigned to each part, achieving precise constraints on the topological structure of various body parts and avoiding posture distortion in localized areas. For more details, see... Figure 6 As shown, it includes:

[0132] First, to characterize the differences in the contributions of different human joints to pose estimation, this application introduces a joint importance weighting mechanism in the joint-level error calculation, assigning higher weights to key joints and normalizing all joint weights to ensure the stability of the overall error scale. The joint-weighted MPJPE is defined as follows:

[0133] ,

[0134] Among these, higher weights are assigned to important joints. , This represents the frame index, where F is the total number of joints in the human skeleton. Represents the important joint group (right shoulder, right wrist, left shoulder, and left wrist) and These are non-critical joints. The predicted and actual locations of critical joints are represented as follows: and The predicted and actual positions of non-critical joints are represented as follows: and .

[0135] Secondly, due to the significant variation in joint positions within the dataset, it is difficult to accurately estimate absolute joint positions. Therefore, this application divides the human body into five parts: the main trunk, left upper limb, right upper limb, left lower limb, and right lower limb. For each part, a corresponding reference root node is selected to construct relative geometric constraints for the joints within that part. The relative geometric loss for a single human body part is defined as:

[0136] ,

[0137] in, Indicates location The set of joints, Indicates the first part of this part One joint, This represents the reference root joint for this location. Manhattan distance is used to calculate the distance here.

[0138] Considering the differences in postural stability and range of motion among different body parts, this application introduces normalized part weights into the relative geometric loss of each part to achieve a balance constraint between the overall structure and local dynamics. Definition of multi-part weighted geometric consistency loss:

[0139] ,

[0140] in, To correspond to the weighting coefficients of different body parts, considering the rapid changes in joint speed of the upper limbs, the weight of the upper limbs can be set higher than that of the trunk and lower limbs.

[0141] Finally, the joint-weighted regression error is jointly optimized with the relative geometric constraints of multiple parts, and the overall loss function is expressed as follows:

[0142] .

[0143] The human skeleton estimation network optimized by the joint loss function can accurately constrain the coordinate error of individual joints and ensure the consistency of the topological structure of various parts of the human body. This effectively improves the modeling ability and estimation robustness of complex human poses, and provides a reliable model foundation for accurate skeleton estimation after inputting human topological point cloud blocks.

[0144] Human skeleton estimation falls under the category of posture parameter regression tasks. During model training, a synchronous data acquisition system for millimeter-wave radar and visual sensors is first constructed. Through temporal alignment and spatial calibration, consistent acquisition of multi-source sensor data is achieved. Secondly, the joint coordinates of the human skeleton extracted from the visual sensor are used as supervisory label information. Simultaneously, a four-dimensional millimeter-wave radar multi-timescale point cloud is generated and divided into multiple human topological blocks based on human structural relationships. The corresponding structural and motion features are extracted and used as input to a multi-level feature adaptive fusion human skeleton estimation network.

[0145] During training, human skeleton joint coordinates are used as supervision signals for the pose regression task, guiding the network to learn the mapping relationship between radar point cloud features and human skeleton structure, thereby improving the accuracy and stability of human skeleton estimation. After training, the human skeleton estimation model is deployed in a practical application system, enabling real-time human skeleton estimation based on millimeter-wave radar point cloud data.

[0146] This application also provides a multi-level feature adaptive fusion human skeleton estimation network framework, such as... Figure 7 As shown, the radar module first processes the echo signal to obtain a normalized human body topology point cloud block, such as... Figure 7 As shown in ①②③, the data is then input into a multi-level feature adaptive fusion human skeleton estimation network framework for processing. Specifically, it is input into the local feature extraction and topology modeling module, the geometric feature extraction module, and the global feature extraction module, respectively.

[0147] The local feature extraction and topology modeling module is the core innovation of the network. It executes in three steps: geometrically perceptive soft clustering → cluster-level spatial geometrically perceptive attention modeling → cluster-point feature diffusion based on spatial affinity. The geometrically perceptive soft clustering unit takes as input the point-level feature matrix (F), 3D coordinate set (P), and initial cluster number of the topological point cloud block. Through the learnable parameter σ, it calculates the soft assignment weight between points and clusters, aggregates the cluster-level features, and updates the cluster centers, avoiding structural breaks caused by hard assignment. The cluster-level spatial geometrically perceptive attention unit first obtains geometric affinity weights based on the Euclidean distance between clusters through MLP mapping, then constructs a joint representation containing spatial orientation information, generates geometrically constrained attention weights, aggregates adjacent cluster features through multi-head attention, and outputs enhanced cluster-level topological features, strengthening the spatial association between different parts of the human body. The cluster-point feature diffusion unit diffuses the enhanced cluster-level features back to the point level, calculates the diffusion weight based on the spatial affinity between the point and the cluster center, and weightedly fuses multi-cluster semantic information to each point, generating point-level features with enhanced topological semantics, preserving point-level details while incorporating global topological constraints. For refined extraction of local geometric features from point clouds, multi-scale dilated convolution is employed to convolve the original 3D coordinates of the topological point cloud blocks, extracting different local geometric features and outputting point-level geometric features. This compensates for point-level details that may be lost during cluster-level learning, complementing the enhanced topological semantic point-level features from cluster-level paths. Geometric feature extraction initially integrates the topological semantic features output from cluster-level paths with the local geometric features output from point-level paths, unifying feature dimensions through linear mapping and layer normalization. Global feature extraction takes the integrated local features from cluster-level and point-level learning as input. First, a pose encoding module maps the local features to global pose semantic vectors (e.g., overall human pitch and tilt trends), then concatenates and pools these vectors to obtain global contextual features.

[0148] The adaptive gating fusion module of multi-level features first performs global average pooling on the three types of features to construct a channel-level global description vector; then it generates fusion weights through a two-layer fully connected network, and after normalization, it sums the three types of features according to the weights to obtain the point cloud fusion features.

[0149] The loss function module constrains the output skeleton from both joint accuracy and body structure dimensions, ensuring that the output skeleton conforms to both joint position accuracy and human topology consistency, perfectly matching the requirements of home monitoring and other scenarios for skeleton estimation accuracy and robustness. The residual regression module integrates the features input into the fully connected layer of the residual structure, transforming the high-dimensional features into the three-dimensional coordinates of the human skeleton joints through nonlinear mapping, and outputting the final human skeleton estimation result.

[0150] As described above, the embodiments of this application first perform target detection on the collected radar echo data, and generate a target four-dimensional point cloud containing distance, velocity, and angle information based on the target detection results. Then, based on the similarity of the velocity directions of each point in the target four-dimensional point cloud, the target four-dimensional point cloud is divided into multiple human topology point cloud blocks. Finally, the human topology point cloud blocks are input into a human skeleton estimation network for human skeleton estimation to obtain the human skeleton estimation result. This application proposes to construct human topology point cloud blocks based on the similarity of point cloud velocity directions, proposes an implicit human topology prior modeling method for millimeter-wave radar, realizes the extraction of structured point cloud block features from sparse point clouds, and designs a multi-level feature adaptive fusion skeleton estimation framework. It proposes dynamic modeling of the point cloud block topology structure and realizes full-scale, multi-level fusion from fine-grained local joints to the overall attitude topology, significantly improving the stability and robustness of attitude prediction. Furthermore, compared to other human skeleton estimation methods based on sensors, the proposed human skeleton estimation method is privacy-friendly and unaffected by lighting and weather conditions, enabling stable operation around the clock and making it suitable for locations with high security and privacy requirements. In addition, this method can construct a human topological prior from sparse millimeter-wave radar point clouds, extracting structure-aware point cloud block features to enhance the model's ability to model the spatial relationships of key human body parts. Moreover, compared to existing methods that often rely on datasets with limited behavioral types and scale, this method can adapt to diverse and natural daily human behaviors in non-intrusive monitoring scenarios, achieving more generalizable human skeleton estimation and effectively improving the accuracy and robustness of human posture prediction.

[0151] It should be understood that the sequence number of each step in this embodiment does not imply the order in which the steps are executed. The execution order of each step should be determined by its function and internal logic, and should not constitute a unique limitation on the implementation process of this application embodiment.

[0152] In summary, the detailed process of the human skeleton estimation method involved in the embodiments of this application can be found in [reference needed]. Figure 8 Specifically:

[0153] Step 801: Perform fast Fourier transform on the collected single-frame radar echo data to obtain short-timescale point cloud information.

[0154] Step 802: Perform non-uniform Fourier transform processing on the collected multi-frame radar echo data to obtain long-term point cloud information;

[0155] Step 803: Target detection is performed on short-timescale point cloud information and long-timescale point cloud information based on the constant false alarm rate algorithm to obtain target detection results;

[0156] Step 804: Extract the range-Doppler index of the effective target points; wherein, the range-Doppler index includes the radial distance information and radial velocity information of the effective target points;

[0157] Step 805: Perform beamforming based on the short-timescale point cloud information and long-timescale point cloud information corresponding to the range-Doppler index, and extract the angle index of the effective target points;

[0158] Step 806: Establish a spatial coordinate system based on the radar installation method, and obtain the four-dimensional coordinates of the effective target points based on the range-Doppler index and the angle index, generating a target four-dimensional point cloud containing range information, velocity information and angle information;

[0159] Step 807: Normalize the radial velocity information of the effective target points to obtain the normalized velocity vector;

[0160] Step 808: Construct a distance metric based on the cosine similarity of each normalized velocity vector, and perform clustering processing on the effective target points according to the distance metric to obtain the clustering processing result;

[0161] Step 809: Based on the clustering results, divide the effective target points to obtain multiple human body topology point cloud blocks;

[0162] Step 810: Input the human body topology point cloud blocks into the human skeleton estimation network to perform human skeleton estimation and obtain the human skeleton estimation results.

[0163] For a more detailed process of each step in steps 801 to 810, please refer to the relevant descriptions shown above. The embodiments of this application will not be repeated here.

[0164] Please see Figure 9 , Figure 9 A radar-based human skeleton estimation system is provided for embodiments of this application. This system can be used to implement the human skeleton estimation method involved in the embodiments of this application. The radar-based human skeleton estimation system mainly includes:

[0165] The point cloud generation module 901 is used to perform target detection on the collected radar echo data and generate a four-dimensional point cloud of the target containing distance information, velocity information and angle information based on the target detection results.

[0166] The point cloud partitioning module 902 is used to divide the target four-dimensional point cloud into multiple human topological point cloud blocks based on the similarity of the velocity directions of each point in the target four-dimensional point cloud.

[0167] The skeleton estimation module 903 is used to input human body topology point cloud blocks into the human skeleton estimation network to perform human skeleton estimation and obtain human skeleton estimation results.

[0168] In some embodiments of this example, when the point cloud generation module 901 performs the function of detecting targets in the collected radar echo data and generating a four-dimensional point cloud of the target containing distance, velocity, and angle information based on the target detection results, it is used to: perform fast Fourier transform processing on the collected single-frame radar echo data to obtain short-timescale point cloud information; perform non-uniform Fourier transform processing on the collected multi-frame radar echo data to obtain long-timescale point cloud information; perform target detection based on the short-timescale point cloud information and the long-timescale point cloud information, and generate a four-dimensional point cloud of the target containing distance, velocity, and angle information based on the target detection results.

[0169] Furthermore, in some embodiments of this example, when the point cloud generation module 901 performs target detection based on short-term and long-term point cloud information and generates a target four-dimensional point cloud containing range, velocity, and angle information based on the target detection results, it specifically performs the following: target detection based on the constant false alarm rate (CFAR) algorithm on the short-term and long-term point cloud information to obtain target detection results; wherein the target detection results include valid target points; extracting the range-Doppler index of the valid target points; wherein the range-Doppler index includes the radial range and radial velocity information of the valid target points; performing beamforming based on the short-term and long-term point cloud information corresponding to the range-Doppler index to extract the angle index of the valid target points; wherein the angle index includes the azimuth and elevation angle information of the valid target points; establishing a spatial coordinate system based on the radar installation method, and obtaining the four-dimensional coordinates of the valid target points based on the range-Doppler index and the angle index, thereby generating a target four-dimensional point cloud containing range, velocity, and angle information.

[0170] In some embodiments of this example, when the point cloud partitioning module 902 performs the division of the target four-dimensional point cloud into multiple human topological point cloud blocks based on the similarity of the velocity directions of each point in the target four-dimensional point cloud, it is used to: perform unit normalization processing on the radial velocity information of the effective target points to obtain a normalized velocity vector; construct a distance metric based on the cosine similarity of each normalized velocity vector, and perform clustering processing on the effective target points according to the distance metric to obtain a clustering processing result; and partition the effective target points based on the clustering processing result to obtain multiple human topological point cloud blocks.

[0171] In some embodiments of this example, the human skeleton estimation network in the human skeleton estimation system includes a point cloud block topology modeling module, a multi-level feature fusion module, and a skeleton regression module connected in sequence: the point cloud block topology modeling module is used to dynamically model the topological structure of the input human topology point cloud block and output point-level features with enhanced topological semantics; the multi-level feature fusion module is used to adaptively fuse point-level features with enhanced topological semantics, original point cloud geometric features, and global context features to output point cloud fused features; and the skeleton regression module is used to map the point cloud fused features to human skeleton joint coordinates to obtain the human skeleton estimation result.

[0172] In some embodiments of this example, the point cloud block topology modeling module in the human skeleton estimation system includes: a geometry-aware soft clustering unit, used to classify the points in the human topology point cloud block into clusters in a soft allocation manner and aggregate them to obtain cluster-level features; a cluster-level spatial geometry-aware attention unit, used to enhance the cluster-level features through an attention mechanism based on the spatial geometric relationship between clusters; and a feature diffusion unit, used to diffuse the enhanced cluster-level features back to the point level based on the spatial proximity between points and clusters, generating point-level features with enhanced topological semantics.

[0173] In some embodiments of this example, before performing the function of inputting human topology point cloud blocks into the human skeleton estimation network for human skeleton estimation, the skeleton estimation module 903 is further used to: train the original human skeleton estimation network based on the joint loss function to obtain an optimized human skeleton estimation network; wherein, the joint loss function includes a first loss term and a second loss term, the first loss term is used to constrain the human joint coordinate error, and the second loss term is used to constrain the coordinate error of each part of the human body.

[0174] In detail, each module in the radar-based human skeleton estimation system provided in this application adopts the same approach as described above when in use. Figure 1 The method used is the same as the human skeleton estimation method used in the text, and it can produce the same technical effect, so it will not be elaborated here.

[0175] Please see Figure 10 , Figure 10 A block diagram of a radar provided in an embodiment of this application.

[0176] like Figure 10As shown, this application embodiment also provides a radar that can be used to implement the human skeleton estimation method in the foregoing embodiments. The radar includes a memory 1001, at least one processor 1002, a signal generator 1003, and a signal receiver 1004. The memory 1001 is used to store at least one program, and when the at least one program is executed by the at least one processor 1002, the at least one processor 1002 executes the human skeleton estimation method provided in this application embodiment.

[0177] Please see Figure 11 , Figure 11 A block diagram of a computer-readable storage medium provided in an embodiment of this application.

[0178] like Figure 11 As shown, this application embodiment also provides a computer-readable storage medium 1100, on which executable instructions 1110 are stored. When the executable instructions 1110 are executed, the human skeleton estimation method provided in this application embodiment is executed.

[0179] The steps of the methods or algorithms described in conjunction with the embodiments disclosed herein can be implemented directly by hardware, a software module executed by a processor, or a combination of both. The software module can be located in random access memory (RAM), main memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, compact disc read-only memory (CD-ROM), or any other form of storage medium known in the art.

[0180] In the above embodiments, implementation can be achieved, in whole or in part, through software, hardware, firmware, or any combination thereof. When implemented in software, it can be implemented, in whole or in part, as a computer program product. A computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the processes or functions described in this application are generated. The computer can be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions can be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another. For example, computer instructions can be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, digital subscriber line) or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer-readable storage medium can be any available medium that a computer can access or a data storage device such as a server or data center that integrates one or more available media. The available medium can be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., Digital Video Disk, DVD), or a semiconductor medium (e.g., Solid State Disk).

[0181] It should be noted that the various embodiments in this application are described in a progressive manner, with each embodiment focusing on the differences from other embodiments. Similar or identical parts between embodiments can be referred to interchangeably. For product-related embodiments, since they are similar to method-related embodiments, the descriptions are relatively simple, and relevant parts can be referred to the descriptions of the method-related embodiments.

[0182] It should also be noted that, in this application, relational terms such as "first" and "second" are used merely to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.

[0183] The above description of the disclosed embodiments enables those skilled in the art to implement or use the content of this application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined in this application may be implemented in other embodiments without departing from the spirit or scope of this application. Therefore, this application is not to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A radar-based human skeleton estimation method, characterized by, include: Target detection is performed on the collected radar echo data, and a four-dimensional point cloud of the target containing distance, velocity and angle information is generated based on the target detection results. Based on the similarity of the velocity directions of each point in the target four-dimensional point cloud, the target four-dimensional point cloud is divided into multiple human body topology point cloud blocks; The human body topology point cloud blocks are input into the human skeleton estimation network to perform human skeleton estimation and obtain the human skeleton estimation result. The human skeleton estimation network includes a point cloud block topology modeling module, a multi-level feature fusion module, and a skeleton regression module connected in sequence. The point cloud block topology modeling module is used to perform dynamic topology modeling on the input human body topology point cloud block and output point-level features with enhanced topological semantics. The multi-level feature fusion module is used to adaptively fuse the point-level features with enhanced topological semantics, the original point cloud geometric features, and the global context features to output point cloud fusion features. The skeleton regression module is used to map the point cloud fusion features into human skeleton joint coordinates to obtain human skeleton estimation results.

2. The human skeleton estimation method according to claim 1, characterized by, The process of detecting targets from the acquired radar echo data and generating a four-dimensional point cloud of the target containing range, velocity, and angle information based on the target detection results includes: Fast Fourier Transform is performed on the collected single-frame radar echo data to obtain short-timescale point cloud information. Non-uniform Fourier transform processing is performed on the collected multi-frame radar echo data to obtain long-term point cloud information; Target detection is performed based on the short-timescale point cloud information and the long-timescale point cloud information, and a four-dimensional point cloud of the target containing distance information, velocity information and angle information is generated based on the target detection results.

3. The human skeleton estimation method according to claim 2, characterized by, The step of performing target detection based on the short-timescale point cloud information and the long-timescale point cloud information, and generating a four-dimensional point cloud of the target containing distance, velocity, and angle information based on the target detection results, includes: Target detection is performed on the short-timescale point cloud information and the long-timescale point cloud information based on the constant false alarm rate algorithm to obtain target detection results; wherein, the target detection results include valid target points; Extract the range-Doppler index of the effective target point; wherein, the range-Doppler index includes the radial distance information and radial velocity information of the effective target point; Beamforming is performed based on the short-timescale point cloud information and the long-timescale point cloud information corresponding to the range-Doppler index to extract the angle index of the effective target point; wherein, the angle index includes the azimuth and elevation angle information of the effective target point; A spatial coordinate system is established based on the radar installation method, and the four-dimensional coordinates of the effective target point are obtained based on the range-Doppler index and the angle index, generating a target four-dimensional point cloud containing range information, velocity information and angle information.

4. The human skeleton estimation method according to claim 3, characterized in that, Based on the similarity of velocity directions of each point in the target four-dimensional point cloud, the target four-dimensional point cloud is divided into multiple human body topological point cloud blocks, including: The radial velocity information of the effective target points is normalized to obtain a normalized velocity vector. A distance metric is constructed based on the cosine similarity of each normalized velocity vector, and the effective target points are clustered according to the distance metric to obtain the clustering result. Based on the clustering results, the effective target points are divided to obtain multiple human body topological point cloud blocks.

5. The human skeleton estimation method according to claim 1, characterized in that, The point cloud block topology modeling module includes: The geometric perception soft clustering unit is used to classify the points in the human body topological point cloud block into clusters in a soft allocation manner and aggregate them to obtain cluster-level features; A cluster-level spatial geometry perception attention unit is used to enhance the cluster-level features based on the spatial geometric relationships between clusters through an attention mechanism; The feature diffusion unit is used to diffuse the enhanced cluster-level features back to the point level based on the spatial proximity between the point and the cluster, thereby generating the point-level features with enhanced topological semantics.

6. The human skeleton estimation method according to claim 1, characterized in that, Before inputting the human body topology point cloud blocks into the human skeleton estimation network for human skeleton estimation, the method further includes: The original human skeleton estimation network is trained based on the joint loss function to obtain the optimized human skeleton estimation network. The joint loss function includes a first loss term and a second loss term. The first loss term is used to constrain the coordinate error of human joints, and the second loss term is used to constrain the coordinate error of various parts of the human body.

7. A radar-based human skeleton estimation system, characterized by, include: The point cloud generation module is used to perform target detection on the collected radar echo data and generate a four-dimensional point cloud of the target containing distance, velocity and angle information based on the target detection results. The point cloud partitioning module is used to divide the target four-dimensional point cloud into multiple human body topology point cloud blocks based on the similarity of the velocity directions of each point in the target four-dimensional point cloud. The skeleton estimation module is used to input the human body topology point cloud blocks into the human skeleton estimation network to perform human skeleton estimation and obtain human skeleton estimation results. The human skeleton estimation network includes a point cloud block topology modeling module, a multi-level feature fusion module, and a skeleton regression module connected in sequence. The point cloud block topology modeling module is used to perform dynamic topology modeling on the input human body topology point cloud block and output point-level features with enhanced topological semantics. The multi-level feature fusion module is used to adaptively fuse the point-level features with enhanced topological semantics, the original point cloud geometric features, and the global context features to output point cloud fusion features. The skeleton regression module is used to map the point cloud fusion features into human skeleton joint coordinates to obtain human skeleton estimation results.

8. A radar, characterized by Includes memory and processor, of which: The processor is used to execute computer programs stored in the memory; When the processor executes the computer program, it implements the steps in the human skeleton estimation method according to any one of claims 1 to 6.

9. A computer readable storage medium having stored thereon a computer program, characterized in that, When the computer program is executed by the processor, it implements the steps in the human skeleton estimation method according to any one of claims 1 to 6.