A light-weight pose estimation method based on frequency domain attention mechanism

A lightweight pose estimation method based on frequency domain attention mechanism solves the problem of drastic performance drop in human pose estimation under low resolution conditions, and achieves efficient human pose estimation on mobile devices with higher network accuracy and less computational resource consumption.

CN119540996BActive Publication Date: 2026-06-26NANJING UNIV OF POSTS & TELECOMM

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
NANJING UNIV OF POSTS & TELECOMM
Filing Date
2024-11-13
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing human pose estimation methods suffer from a sharp performance drop at low resolution, failing to meet the deployment requirements of mobile devices, and consume enormous computational resources.

Method used

A lightweight pose estimation method based on frequency domain attention mechanism is adopted. By grouping and compressing feature data and restoring frequency matching degree, a lightweight resolution human pose estimation network is constructed, including a frequency domain attention module, a channel attention module, and a spatial attention module. Low-pass and high-pass filtering modules are used to extract low- and high-frequency features, thereby reducing the number of network parameters.

Benefits of technology

While reducing network parameters, it mitigates the information loss caused by feature compression, improves network accuracy, and is suitable for mobile devices.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN119540996B_ABST
    Figure CN119540996B_ABST
Patent Text Reader

Abstract

The application belongs to the technical field of human pose estimation, and discloses a lightweight pose estimation method based on a frequency domain attention mechanism, which first compresses the input feature information through grouped convolution, and then restores the compressed feature information through three attention modules. Firstly, the frequency attention module extracts different frequency components by using a filtering module, and selects important frequency components by softmax normalization key value selection; secondly, the channel and spatial attention modules better capture global information and accurately locate key features. The calculation complexity of the frequency domain attention mechanism module proposed in the application is linearly related to the image resolution, and the calculation complexity is smaller, so that the problem of rapid expansion of the calculation amount with the increase of the resolution can be solved, and meanwhile, the module can reduce the information loss caused by the feature compression of the grouped convolution, so as to further improve the human pose estimation accuracy in the lightweight scene.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of human pose estimation technology, specifically relating to a lightweight pose estimation method based on a frequency domain attention mechanism. Background Technology

[0002] Human pose estimation refers to the process of reconstructing human pose by detecting and locating key points on the human body. This task is one of the major challenges in the field of computer vision, playing a crucial role in many computer vision applications, such as action recognition, intelligent video surveillance, and human-computer interaction. Currently, various human pose estimation methods have been proposed and have achieved good results under high-resolution data conditions. However, as image resolution decreases, the loss of image information leads to a sharp drop in the performance of existing models, making them unable to meet the growing demands of production and applications. Pose estimation algorithms extract multi-scale features and design complex decoder heads to establish relationships between these features. These improvements come at the cost of a huge number of parameters and high computational cost.

[0003] Currently, top-down methods have fully met the accuracy requirements for industrial applications. However, since human pose estimation is a pixel-level dense prediction task, it requires maintaining a high spatial dimension when extracting features through neural networks. This results in huge computational resources required by deep neural networks. The stringent computational resource requirements of these large networks make it difficult to deploy them on increasingly common mobile devices. Summary of the Invention

[0004] To address the aforementioned technical problems, this invention provides a lightweight pose estimation method based on a frequency domain attention mechanism. This method proposes a feature data grouping and compression approach, and then uses the matching degree of different frequencies to restore the feature information, thereby constructing a lightweight resolution human pose estimation method, which solves the technical problem of the huge computational resources required by deep neural networks.

[0005] To achieve the above objectives, the present invention is implemented through the following technical solution:

[0006] This invention is a lightweight pose estimation method based on a frequency domain attention mechanism. The lightweight pose estimation method utilizes a lightweight feature extraction network N based on a frequency domain attention mechanism. x The lightweight feature extraction network N based on the frequency domain attention mechanism is implemented. x Includes convolutional layers, grouped positional encoding, and a lightweight human pose estimation network module M. x And the fully connected layer, the specific lightweight pose estimation method includes the following steps:

[0007] Step 1: Select the MSCOCO human pose estimation dataset as the training dataset, and obtain the input feature map by cropping the image from the human detection box in the labeled data;

[0008] Step 2: Construct a lightweight human pose estimation network module M based on a frequency domain attention mechanism. x The lightweight human pose estimation network module M x It includes a frequency domain attention module, a channel attention module, and a spatial attention module.

[0009] The frequency domain attention module includes four dynamic low-pass filter modules, four dynamic high-pass filter modules, and a frequency matching kernel module;

[0010] Each of the dynamic low-pass filter modules includes an average pooling layer, a bilinear interpolation layer, and a ReLU layer. The pooling sizes of the four dynamic low-pass filter modules are 1×1, 2×2, 3×3, and 6×6, respectively.

[0011] Each of the dynamic high-pass filter modules includes a grouped deep convolutional layer, and the four grouped deep convolutional layers are respectively composed of convolutional layers with kernels of 1×1×1, 3×3×3, 5×5×5 and 7×7×7;

[0012] The frequency matching kernel module includes a linear layer and a frequency domain distributed computation layer;

[0013] Step 3: The lightweight human pose estimation network module M constructed in Step 2. x Construct a lightweight feature extraction network N based on a frequency domain attention mechanism. x Visual features are generated from the input feature map extracted in step 1 using convolutional coding.

[0014] Step 4: Use the constructed training dataset to train the lightweight human pose estimation network module M based on the frequency domain attention mechanism. x ;

[0015] Step 5: Use the trained lightweight human pose estimation network based on frequency domain attention mechanism to estimate the pose of the human image and obtain the corresponding human pose estimation results.

[0016] A further improvement of the present invention is that, in step 2, a lightweight human pose estimation network module M based on a frequency domain attention mechanism is constructed. x This includes frequency domain attention module extraction, channel attention module extraction, and spatial attention module extraction, specifically including the following steps:

[0017] Step 2.1, Frequency Domain Attention Module Extraction: The input feature map obtained in Step 1 is passed through the convolutional layer of the frequency matching kernel module to obtain the input feature vector x. The input feature vector x is then input into the linear layer of the frequency matching kernel module to obtain the query tensor Q, key tensor K, and value tensor V. By calculating the attention matrix of the query tensor Q, key tensor K, and value tensor V, the frequency domain attention feature tensor FF is obtained. Simultaneously, the low-frequency feature LF is extracted through the dynamic low-pass filtering module, and the high-frequency feature HF is extracted through the dynamic high-pass filtering module. The low-frequency feature LF, the high-frequency feature HF, and the frequency domain attention feature tensor FF are added together to obtain the frequency domain attention module M. f ;

[0018] Step 2.2: Following the channel attention module, the feature vectors from Step 2.1 are compressed using global average pooling and global max pooling to compress the spatial dimension, generating global features describing each channel. These global features are then fed into a shared multilayer perceptron (MLP) to learn the attention weights for each channel. The module outputs the global feature M. c ;

[0019] Step 2.3: Analyze the global feature M output in Step 2.2. c Global average pooling and global max pooling are performed to generate two 2D feature maps. These two 2D feature maps are concatenated and processed through a convolutional layer to generate spatial attention weights. The spatial attention weight feature M is then output by the module. s ;

[0020] Step 2.4: Construct a lightweight human pose estimation network module M based on a frequency domain attention mechanism. x :

[0021] M x =M f +M c ×M f +M s ×M f .

[0022] A further improvement of the present invention is that step 2.1, extracting the low-frequency feature LF through the dynamic low-pass filter module, specifically includes the following steps:

[0023] Step 2.1.1: Obtain the height and width of the input feature map;

[0024] Step 2.1.2: Divide the input feature map into m parts according to the number of channels, m≥4, input each segmented feature map into the corresponding filter, and use bilinear interpolation to upsample the output feature map to the original size;

[0025] Step 2.1.3: Concatenate the m upsampled feature maps along the channel dimension, and activate them through the ReLU layer of the dynamic low-pass filter module. The final output is the low-frequency feature LF after low-pass filtering: The expression is as follows:

[0026] LF = bilinear(AvgPooling) k×k (V))

[0027] Where V is the value tensor and k is the cutoff frequency, and the cutoff frequency is different for each channel.

[0028] A further improvement of the present invention is that, in step 2.1, the high-frequency feature HF is extracted through a dynamic high-pass filter module, specifically including the following steps:

[0029] Step 2.2.1: Divide the value tensor V into n groups and use the grouped deep convolutional layer in the dynamic high-pass filter module to simulate the tensor features of the cutoff frequency in different high-pass filters;

[0030] Step 2.2.2: Concatenate the tensor features obtained in Step 2.2.1 to obtain the high-frequency feature (HF):

[0031] HF = depthwise_conv k×k (V).

[0032] A further improvement of the present invention is that, in step 2.1: frequency matching kernel module A i,j Represented as:

[0033]

[0034] Where, k i Let v represent the i-th frequency component of the bond tensor K. j This represents the j-th frequency component of the value tensor V;

[0035] Through frequency matching kernel module A i,j Obtaining the frequency domain attention feature tensor FF involves the following steps:

[0036] Step 2.3.1: Divide the input feature vector x into multiple groups. i Shooting to query tensor Q i Key tensor K i Value tensor V i ;

[0037] Step 2.3.2, for the key tensor K i Perform softmax operation to obtain Will Sum tensor V i Multiply, we get

[0038] Step 2.3.3: Use the einsum function to convert the query tensor Q. i and Multiplying them together yields the final attention matrix A. i ;

[0039] Step 2.3.4: Concatenate the attention features obtained from each group to obtain the frequency domain attention feature tensor FF.

[0040] A further improvement of the present invention is that step 4 specifically includes the following steps:

[0041] Step 4.1: Construct a human pose estimator, encode the visual features extracted in step 3, and obtain H heatmaps of human joints, where H represents the number of predefined human joint categories in the dataset.

[0042] Step 4.2: Calculate the loss using the mean square error loss function on the heatmap of H human joints obtained in Step 4.1 and the Gaussian heatmap constructed based on the annotation information, and optimize the network.

[0043] The mean squared error loss function, LMSE, is specifically as follows:

[0044]

[0045] Where H represents the predefined human joint category in the dataset, This represents the true heatmap corresponding to the h-th joint. This represents the predicted heatmap corresponding to the h-th joint.

[0046] The beneficial effects of this invention are: by exploring and constructing the correlation between human images with features of different frequency components, this invention proposes a method for feature data grouping and compression, and then using the matching degree of different frequencies to restore feature information, thereby constructing a lightweight, high-resolution human pose estimation method; the method

[0047] This invention can significantly reduce the number of network parameters while mitigating information loss caused by compression features, and has higher network accuracy compared to traditional lightweight networks such as MobileNet.

[0048] This invention introduces a lightweight human pose estimation network module with frequency domain attention. It performs multi-segment compression of features through high-pass and low-pass filtering modules with different cutoff frequencies. Compared with traditional group compression methods, this module can retain more useful semantics of the corresponding frequencies. The grouped feature attention mechanism can significantly reduce the computational overhead introduced by multi-head attention in the TransFormer network.

[0049] This invention solves the problem of information loss caused by feature compression. Compared with methods such as HRFormer, this module only requires a small number of parameters to reconstruct the details of data features, while achieving better model performance and accuracy. Attached Figure Description

[0050] Figure 1 This is a flowchart illustrating the present invention.

[0051] Figure 2 This is a flowchart of the frequency domain attention module of the present invention.

[0052] Figure 3 This is a flowchart of the lightweight human pose estimation network module based on the frequency domain attention mechanism of the present invention.

[0053] Figure 4 This is a schematic diagram of the lightweight feature extraction network based on the frequency domain attention mechanism of the present invention. Detailed Implementation

[0054] The embodiments of the present invention will be disclosed below with reference to the drawings. For clarity, many practical details will be described in the following description. However, it should be understood that these practical details are not intended to limit the invention. That is, in some embodiments of the invention, these practical details are not essential.

[0055] like Figure 1 As shown, this invention is a lightweight human pose estimation method based on a frequency domain attention mechanism. It improves upon the basic human pose estimation process, minimizing the original backbone network while preserving useful information to the greatest extent possible. This enables human pose estimation in scenarios such as mobile devices. The lightweight pose estimation method utilizes a lightweight feature extraction network N based on a frequency domain attention mechanism. x The lightweight feature extraction network Nx based on the frequency domain attention mechanism includes convolutional layers, grouped position encoding, and a lightweight human pose estimation network module M. x And the fully connected layer, the specific lightweight pose estimation method includes the following steps:

[0056] Step 1: Select the MSCOCO human pose estimation dataset as the training dataset, and obtain the input feature map by cropping the image from the human detection box in the labeled data;

[0057] Step 2: Construct a lightweight human pose estimation network module M based on a frequency domain attention mechanism. x The lightweight human pose estimation network module M x It includes a frequency domain attention module, a channel attention module, and a spatial attention module.

[0058] The frequency domain attention module includes four dynamic low-pass filter modules, four dynamic high-pass filter modules, and a frequency matching kernel module. Each dynamic low-pass filter module includes an average pooling layer, a bilinear interpolation layer, and a ReLU layer. The pooling sizes of the four dynamic low-pass filter modules are 1×1, 2×2, 3×3, and 6×6, respectively. Each dynamic high-pass filter module includes a grouped deep convolutional layer. The four grouped deep convolutional layers are composed of convolutional layers with kernels of 1×1×1, 3×3×3, 5×5×5, and 7×7×7, respectively. The frequency matching kernel module includes a linear layer and a frequency domain distributed computation layer.

[0059] The dynamic low-pass filtering module performs average pooling and bilinear interpolation of different sizes on m grouped features to extract low-frequency semantic information. This yields m low-frequency components. The dynamic high-pass filtering module divides the key tensor K and value tensor V into n groups. For each group, a convolutional layer with a different kernel is used to simulate the cutoff frequencies in different high-pass filters. Matrix multiplication of the obtained high-frequency features reduces high-frequency noise in the human body. This yields n high-frequency components. The frequency domain attention module uses a fixed-size matching kernel to represent the correspondence between different frequency components. For the feature groups extracted by the convolutional layer, the key tensor K and value tensor V of the frequency components are calculated, and the softmax operation is used to normalize the keys between frequency components. Important frequency components are selected by querying the frequency matching kernel, reducing feature information loss caused by grouping calculations.

[0060] In step 2, a lightweight human pose estimation network module M based on a frequency domain attention mechanism is constructed. x This includes frequency domain attention module extraction, channel attention module extraction, and spatial attention module extraction, specifically including the following steps:

[0061] Step 2.1, Frequency Domain Attention Module Extraction: The input feature map obtained in Step 1 is passed through the convolutional layer of the frequency matching kernel module to obtain the input feature vector x. The input feature vector x is then input into the linear layer of the frequency matching kernel module to obtain the query tensor Q, key tensor K, and value tensor V. By calculating the attention matrix of the query tensor Q, key tensor K, and value tensor V, the frequency domain attention feature tensor FF is obtained. Simultaneously, the low-frequency feature LF is extracted through the dynamic low-pass filtering module, and the high-frequency feature HF is extracted through the dynamic high-pass filtering module. The low-frequency feature LF, the high-frequency feature HF, and the frequency domain attention feature tensor FF are added together to obtain the frequency domain attention module M. f .

[0062] Among them, such as Figure 2 As shown, the extraction of low-frequency features (LF) through the dynamic low-pass filter module specifically includes the following steps:

[0063] Step 2.1.1: Obtain the height and width of the input feature map;

[0064] Step 2.1.2: Divide the input feature map into 4 parts according to the number of channels, input each part into the corresponding filter, and use bilinear interpolation to upsample the output feature map to the original size;

[0065] Step 2.1.3: Concatenate the m upsampled feature maps along the channel dimension, and activate them through the ReLU layer of the dynamic low-pass filter module. The final output is the low-frequency feature LF after low-pass filtering: The expression is as follows:

[0066] LF = bilinear(AvgPooling) k×k (V))

[0067] Where V is the value tensor and k is the cutoff frequency, and the cutoff frequency is different for each channel.

[0068] The extraction of high-frequency features (HF) through a dynamic high-pass filter module specifically includes the following steps:

[0069] Step 2.2.1: Divide the value tensor V into n groups and use the grouped deep convolutional layer in the dynamic high-pass filter module to simulate the tensor features of the cutoff frequency in different high-pass filters;

[0070] Step 2.2.2: Concatenate the tensor features obtained in Step 2.2.1 to obtain the high-frequency feature (HF):

[0071] HF = depthwise_conv k×k (V).

[0072] Among them, frequency matching kernel module A i,j Represented as:

[0073]

[0074] Where, k i Let v represent the i-th frequency component of the bond tensor K. j This represents the j-th frequency component of the value tensor V;

[0075] Through frequency matching kernel module A i,j Obtaining the frequency domain attention feature tensor FF involves the following steps:

[0076] Step 2.3.1: Divide the input feature vector x into multiple groups. i Shooting to query tensor Q i Key tensor K i Value tensor V i ;

[0077] Step 2.3.2, for the key tensor K i Perform softmax operation to obtain Will Sum tensor V i Multiply, we get

[0078] Step 2.3.3: Use the einsum function to convert the query tensor Q. i and Multiplying them together yields the final attention matrix A. i ;

[0079] Step 2.3.4: Concatenate the attention features obtained from each group to obtain the frequency domain attention feature tensor FF.

[0080] Step 2.2: Following the channel attention module, the feature vectors from Step 2.1 are compressed using global average pooling and global max pooling to compress the spatial dimension, generating global features describing each channel. These global features are then fed into a shared multilayer perceptron (MLP) to learn the attention weights for each channel. The module outputs the global feature M. c Increase the weight of important channel components;

[0081] Step 2.3: Analyze the global feature M output in Step 2.2. c Global average pooling and global max pooling are performed to generate two 2D feature maps. These two 2D feature maps are concatenated and processed through a convolutional layer to generate spatial attention weights. The spatial attention weight feature M is then output by the module. s ;

[0082] Step 2.4: Construct a lightweight human pose estimation network module M based on a frequency domain attention mechanism. x :

[0083] M x =M f +M c ×M f +M s ×M f .

[0084] Step 3: The lightweight human pose estimation network module M constructed in Step 2. x Construct a lightweight feature extraction network N based on a frequency domain attention mechanism. x The input feature map extracted in step 1 is used to generate visual features through convolutional coding, such as... Figure 3 As shown.

[0085] Step 4: Use the constructed training dataset to train the lightweight human pose estimation network module M based on the frequency domain attention mechanism. x Specifically, it includes the following steps:

[0086] Step 4.1: Construct a human pose estimator by encoding the visual features extracted in Step 3 to obtain H heatmaps of human joints, where H represents the number of predefined human joint categories in the dataset. Specifically... Figure 3 As shown, the feature map information of the image is first obtained through a convolutional module; then, the feature information of the image is grouped and added to the corresponding positional encoding to obtain a keypoint token containing positional information. A lightweight human pose estimation network module M based on a frequency domain attention mechanism is used. x The MHSA module, which replaces the multi-head attention mechanism in the Transformer network, is then used. Finally, a fully connected MLP network layer is used to obtain the human joint heatmap required for human pose estimation.

[0087] Step 4.2: Calculate the loss using the mean square error loss function on the heatmap of H human joints obtained in Step 4.1 and the Gaussian heatmap constructed based on the annotation information, and optimize the network.

[0088] The mean squared error loss function, LMSE, is specifically as follows:

[0089]

[0090] Where H represents the predefined human joint category in the dataset, This represents the true heatmap corresponding to the h-th joint. This represents the predicted heatmap corresponding to the h-th joint.

[0091] Step 5: Use the trained lightweight human pose estimation network based on frequency domain attention mechanism to estimate the pose of the human image and obtain the corresponding human pose estimation results.

[0092] To verify this invention, it was compared with the simcc_mobilenetv2 and pose_shufflenetv2 methods. The results are shown in Table 1 below:

[0093] Table 1

[0094]

[0095] As shown in Table 1 above, the network model described in this invention has only 3.118M parameters, which is similar to those of other lightweight models. Compared with other similar lightweight models, the improved model shows better performance in terms of accuracy.

[0096] The above description is merely an embodiment of the present invention and is not intended to limit the invention. Various modifications and variations can be made to the present invention by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principle of the present invention should be included within the scope of the claims of the present invention.

Claims

1. A lightweight pose estimation method based on a frequency domain attention mechanism, characterized in that: The lightweight pose estimation method employs a lightweight feature extraction network based on a frequency domain attention mechanism. The lightweight feature extraction network based on the frequency domain attention mechanism is implemented. Includes convolutional layers, grouped positional encoding, and a lightweight human pose estimation network module. And a fully connected layer, the lightweight pose estimation method specifically includes the following steps: Step 1: Select the MSCOCO human pose estimation dataset as the training dataset, and obtain the input feature map by cropping the image from the human detection box in the labeled data; Step 2: Construct a lightweight human pose estimation network module based on frequency domain attention mechanism. The lightweight human pose estimation network module It includes a frequency domain attention module, a channel attention module, and a spatial attention module. The frequency domain attention module includes four dynamic low-pass filter modules, four dynamic high-pass filter modules, and a frequency matching kernel module; Each of the dynamic low-pass filter modules includes an average pooling layer, a bilinear interpolation layer, and a ReLU layer. The pooling sizes of the four dynamic low-pass filter modules are respectively... , , and ; Each of the dynamic high-pass filter modules includes a grouped deep convolutional layer, and the four grouped deep convolutional layers are respectively composed of convolutional kernels. , , and It consists of convolutional layers; The frequency matching kernel module includes a linear layer and a frequency domain distributed computation layer; Step 3: Lightweight human pose estimation network module built based on Step 2 Construct a lightweight feature extraction network based on frequency domain attention mechanism Visual features are generated from the input feature map extracted in step 1 using convolutional coding. Step 4: Encode the visual features extracted in Step 3 to obtain H heatmaps of human joints, where H represents the number of predefined human joint categories in the dataset. Compare the obtained human joint heatmaps with the Gaussian heatmaps constructed based on the annotation information, calculate the loss using the mean squared error loss function, and train a lightweight feature extraction network based on a frequency domain attention mechanism. ; Step 5: Use the trained frequency domain attention mechanism lightweight feature extraction network Pose estimation is performed on human images to obtain the corresponding human pose estimation results, where: Step 2 specifically includes the following steps: Step 2.1, Frequency Domain Attention Module Extraction: The input feature map obtained in Step 1 is processed through the convolutional layer of the frequency matching kernel module to obtain the input feature vector. , input feature vector The query tensor is obtained by inputting the linear layer of the frequency-matching kernel module. Key tensors Sum tensor By calculating the query tensor Key tensors Sum tensor The attention matrix is ​​used to obtain the frequency domain attention feature tensor. Simultaneously, low-frequency features are extracted through a dynamic low-pass filter module. High-frequency features are extracted using a dynamic high-pass filter module. low-frequency features High-frequency characteristics and frequency domain attention feature tensor Adding them together yields the frequency domain attention module. ; Step 2.2: Following the channel attention module, the feature vectors from Step 2.1 are compressed using global average pooling and global max pooling to compress the spatial dimension, generating global features describing each channel. These global features are then fed into a shared multilayer perceptron (MLP) to learn the attention weights for each channel. The module outputs the global features. ; Step 2.3: Analyze the global features output in Step 2.

2. Global average pooling and global max pooling are performed to generate two 2D feature maps. These two 2D feature maps are concatenated and processed through a convolutional layer to generate spatial attention weights. The spatial attention weight features are then output by the module. ; Step 2.4: Construct a lightweight human pose estimation network module based on frequency domain attention mechanism. : 。 2. The lightweight pose estimation method based on frequency domain attention mechanism according to claim 1, characterized in that: Step 2.1 involves extracting low-frequency features using a dynamic low-pass filter module. Specifically, the steps include the following: Step 2.1.1: Obtain the height and width of the input feature map; Step 2.1.2: Segment the input feature map according to the number of channels. share, Each segmented feature map is input into its corresponding filter, and bilinear interpolation is used to upsample the output feature map to its original size. Step 2.1.3, will The upsampled feature maps are concatenated along the channel dimension and activated by the ReLU layer of the dynamic low-pass filter module, ultimately outputting the low-frequency features after low-pass filtering. The expression is as follows: in, For value tensors, The cutoff frequency is different for each channel.

3. The lightweight pose estimation method based on frequency domain attention mechanism according to claim 1, characterized in that: In step 2.1, high-frequency features are extracted using a dynamic high-pass filter module. Specifically, it includes the following steps: Step 2.2.1: Convert the value tensor Divided into Each group uses a grouped deep convolutional layer in the dynamic high-pass filter module to simulate the tensor features of the cutoff frequencies in different high-pass filters. Step 2.2.2: Concatenate the tensor features obtained in Step 2.2.1 to obtain the high-frequency features. : 。 4. The lightweight pose estimation method based on frequency domain attention mechanism according to claim 1, characterized in that: In step 2.1: frequency matching kernel module Represented as: in, Key tensor The One frequency component, Value tensor The One frequency component; Through frequency matching kernel module Obtain the frequency domain attention feature tensor Specifically, the steps include the following: Step 2.3.1: Input feature vector Divide the input feature vectors into multiple groups. Mapping to query tensor Key tensors Value tensor ; Step 2.3.2, Key Tensor Perform a softmax operation to obtain the intermediate distribution vector. ,Will Sum tensor Multiply, we get ; Step 2.3.3: Use the einsum function to convert the query tensor and Multiplying them yields the final attention matrix. ; Step 2.3.4: Concatenate the attention features obtained from each group to obtain the frequency domain attention feature tensor. .

5. A lightweight pose estimation method based on a frequency domain attention mechanism according to claim 1, characterized in that: Step 4 specifically includes the following steps: Step 4.1: Construct a human pose estimator by encoding the visual features extracted in Step 3 to obtain... A heat map of human joints, in which This represents the number of predefined human joint categories in the dataset; Step 4.2, regarding the results obtained in Step 4.1 The heatmap of the human body's joints and the Gaussian heatmap constructed based on the annotation information are compared using the mean square error loss function to calculate the loss and optimize the network. The mean squared error loss function, LMSE, is specifically as follows: in, This represents the predefined categories of human joints in the dataset. This represents the true heatmap corresponding to the h-th joint. This represents the predicted heatmap corresponding to the h-th joint.