Human pose estimation method and system based on dynamic perception sparse attention
By employing a dynamic perception sparse attention method, the most valuable regions in human pose estimation are filtered out, solving the problems of accuracy and complexity in occlusion pose processing and achieving efficient human pose estimation.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SHANGHAI UNIV
- Filing Date
- 2024-04-26
- Publication Date
- 2026-06-23
AI Technical Summary
Existing multi-person pose estimation algorithms offer only minor improvements in accuracy when dealing with occluded poses, and have high computational complexity, making it difficult to effectively capture long-term dependencies between features.
A dynamic perception-based sparse attention approach is adopted. The query, key, and value tensors are obtained through region partitioning and linear projection processing. The most valuable regions are filtered out using primary and secondary routing algorithms, attention is calculated, and pose information is extracted by combining multilayer perceptron to generate a key point heatmap.
It improves the accuracy of human pose estimation, reduces computational complexity, enhances the ability to handle occluded poses, and improves the computational efficiency of the algorithm.
Smart Images

Figure CN118522068B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to human pose estimation technology, and more particularly to a multi-person pose estimation method and system based on dynamic perception sparse attention and inverse generation decision network. Background Technology
[0002] Currently available multi-person pose estimation algorithms can be broadly categorized into two types based on their technical methods: CNN-based algorithms and Transformer-based algorithms. While CNN-based algorithms have achieved significant results in human pose estimation, they lack the ability to capture global, long-term, and high-order spatial interactions to capture long-term dependencies of input features. It remains unclear which spatial dependencies the network captures for keypoint localization.
[0003] In recent years, some studies have improved the algorithm's ability to handle occluded poses by expanding the receptive field, spatial information interaction, and modeling the correlation of poses between people, but the effect has been minimal. With the great success of Transformer in tasks such as natural language processing, object detection, and image classification, researchers have begun to introduce Transformer into multi-person pose estimation.
[0004] Current Transformer-based methods mainly rely on self-attention mechanisms for high-order spatial interactions, thus effectively capturing long-term dependencies between features. Although this optimizes the keypoint occlusion problem to some extent, the improvement in pose estimation accuracy is still minimal. Furthermore, the high secondary complexity of self-attention in this approach also leads to high complexity and memory consumption.
[0005] Therefore, occlusion remains one of the most challenging problems in the field of human pose estimation, and further breakthroughs by researchers are urgently needed. Summary of the Invention
[0006] Therefore, the main objective of this invention is to provide a human pose estimation method and system based on dynamic perception sparse attention, so as to further enhance the ability to process occluded poses, improve the overall accuracy of human pose estimation, and reduce computational complexity.
[0007] To achieve the above objectives, according to one aspect of the present invention, a method for estimating human pose based on dynamic perception sparse attention is provided, comprising the following steps:
[0008] Step S100: Preprocess the raw image data to obtain feature maps of the people in the image;
[0009] Step S200: Based on the preset block size, the feature map is divided into regions and linearly projected to obtain query, key, and value tensors, so as to derive the affinity map of the relevant regions in the feature map through the primary routing algorithm.
[0010] Step S300 filters out the most valuable region in the affinity map of the relevant region, performs attention calculation, obtains the corresponding weight matrix, and then inputs it into the multilayer perceptron to extract pose information;
[0011] Step S400 inputs the posture information into the encoder to refine the posture information and generates a key point heatmap to detect human posture.
[0012] In a possible preferred embodiment, the human pose estimation method based on dynamic perception sparse attention further includes step S200: filling feature maps that cannot be divided by a preset block size.
[0013] In a possible preferred embodiment, the step of preprocessing the original image data in step S100 includes: annotating the coordinates of key points of people in the original image to generate real labels; vectorizing the original image and normalizing the mean and standard deviation of the RGB channels; selecting image blocks of people in the original image according to a preset number of values; and combining the image blocks of people with the corresponding real labels to form a feature map.
[0014] In a possible preferred embodiment, step S200, which involves region partitioning and linear projection processing of the feature map, includes:
[0015] Step S210 involves reshaping the feature map into... The feature map is divided into S×S non-overlapping regions, such that each region contains There are 3 feature vectors, where H, W, and S are the height, width, and area dimensions, respectively, and C is the channel size;
[0016] Step S220 by... The linear projections derive the query tensor Q, key tensor K, and value tensor V:
[0017]
[0018] in , , These are the projection weights for the query, key, and value, respectively.
[0019] In a possible preferred embodiment, step S200, which involves deriving the affinity graph of the relevant regions in the feature map using the primary routing algorithm, includes:
[0020] Step S230 calculates the average for each S×S region after dividing Q and K, deriving the region-level query and key. , ;
[0021] Step S240 through and The adjacency matrix of the region affinity graph is derived using matrix multiplication. :
[0022]
[0023] in .
[0024] In a possible preferred embodiment, the step of filtering out the most valuable regions in the relevant region affinity map in step S300 includes:
[0025] Step S310: Based on the adjacency matrix By applying topK operations, the routing index matrix is derived. :
[0026]
[0027] To obtain the m most valuable regions.
[0028] In a possible preferred embodiment, the human pose estimation method based on dynamic perception sparse attention further includes the following steps:
[0029] Step S320 takes the most valuable region obtained in step S310 as the center, and after performing a custom center diffusion correlation convolution, calculates the score of the relevant region and filters out the most valuable region a second time.
[0030] Step S330 aggregates the most valuable regions obtained in steps S310 to S320 respectively.
[0031] The human pose estimation method based on dynamic perception sparse attention according to claim 1, wherein the steps further include:
[0032] Step S500 performs noise occlusion processing on the feature map corresponding to the human pose detection result according to the occlusion index to generate a feature map.
[0033] To achieve the above objectives, corresponding to the above method, according to another aspect of the present invention, a human pose estimation system based on dynamic perception sparse attention is also provided, comprising:
[0034] The training set processing unit is used to preprocess the raw image data and obtain feature maps of people in the image.
[0035] The main network unit is used to divide the feature map into regions and perform linear projection processing according to the preset block size, obtain query, key, and value tensors, and derive the affinity map of the relevant regions in the feature map through the primary routing algorithm. After passing through the first-level and second-level routing algorithms, the most valuable regions in the affinity map of the relevant regions are filtered out and fused. Then, attention is calculated to obtain the corresponding weight matrix, which is then input into the multilayer perceptron to extract pose information, and input into the encoder to refine the pose information. Based on this, a key point heatmap is generated to detect human pose.
[0036] The inverse generation decision network unit is used to process the feature map corresponding to the human pose detection result by noise occlusion according to the occlusion index, and generate a feature map to train the main network.
[0037] To achieve the above objectives, in accordance with the above methods, according to another aspect of the present invention, a computer-readable storage medium is also provided, the computer-readable storage medium storing a computer program, wherein when the computer program is executed, it implements the steps of the human pose estimation method based on dynamic perception sparse attention as described in any of the preceding claims.
[0038] The human pose estimation method and system based on dynamic perception sparse attention of this invention can accurately locate the most valuable key-value pairs at the coarse-grained region level to participate in the final attention calculation while retaining the high-order spatial interaction capabilities of self-attention. This eliminates irrelevant regions, thereby removing their negative impact on accuracy, thus improving the accuracy of human pose estimation and significantly reducing the computational complexity of the algorithm.
[0039] Furthermore, in the corresponding example, the algorithm of this invention uses a master detection network based on dynamic perception sparse attention for pose learning and detection, and utilizes an inverse generation decision network to dynamically generate difficult images with different occlusion indices based on the pose estimation results of the master network and add them to the training, thereby helping to improve the representation and generalization ability of the master detection network for occluded samples, thereby further improving the accuracy of human pose estimation. Attached Figure Description
[0040] The accompanying drawings, which form part of this application, are used to provide a further understanding of the invention. The illustrative embodiments of the invention and their descriptions are used to explain the invention and do not constitute an undue limitation of the invention. In the drawings:
[0041] Figure 1 This is a schematic diagram illustrating the steps of the human pose estimation method based on dynamic perception sparse attention of the present invention.
[0042] Figure 2 This is a schematic diagram of the human pose estimation method based on dynamic perception sparse attention according to the present invention.
[0043] Figure 3 This is a schematic diagram illustrating the principle of dynamically perceiving sparse attention in an example of the method of the present invention.
[0044] Figure 4 This is a schematic diagram of the routing-related area of the two-level routing algorithm in the method example of the present invention;
[0045] Figure 5 This is a schematic diagram illustrating the generation of occlusion feature images by the reverse generation decision network in an example of the method of the present invention;
[0046] Figure 6 This is a schematic diagram of the human pose estimation system based on dynamic perception sparse attention according to the present invention. Detailed Implementation
[0047] To enable those skilled in the art to better understand the technical solutions of the present invention, the specific technical solutions of the present invention will be clearly and completely described below in conjunction with embodiments, so as to help those skilled in the art further understand the present invention. Obviously, the embodiments described in this application are merely some embodiments of the present invention, and not all embodiments. It should be noted that, for those skilled in the art, the embodiments and features in the embodiments of this application can be combined with each other without departing from the concept of the present invention and without conflict. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort should fall within the disclosure and protection scope of the present invention.
[0048] Furthermore, the terms "first," "second," "S100," "S200," etc., used in the specification, claims, and drawings of this invention are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such features can be interchanged where appropriate so that embodiments of the invention described herein can be implemented in orders other than those described herein. At the same time, the stages described in each step are not necessarily to be implemented in the same step; it should be understood that the implementation order of the contents of each step stage can be adjusted and interchanged without violating the inventive concept, so that embodiments of the invention described herein can be implemented in orders other than those described herein. Additionally, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion. Unless otherwise expressly specified and limited, the terms "set," "arrange," "install," "connect," and "link" should be interpreted broadly. For example, they can refer to a fixed connection, a detachable connection, or an integral connection; a mechanical connection or an electrical connection; a direct connection or an indirect connection through an intermediate medium; or a connection within two elements. Those skilled in the art can understand the specific meaning of the above terms in this case based on the specific circumstances and in conjunction with existing technology.
[0049] To enhance the ability to handle occlusion poses, improve the overall accuracy of human pose estimation, and reduce computational complexity. For example... Figures 1 to 5 As shown, this invention provides a human pose estimation method based on dynamic perception sparse attention, the steps of which include:
[0050] Step S100: Preprocess the raw image data to obtain feature maps of the people in the image.
[0051] Specifically, see Figure 2 As shown, the preprocessing steps for the raw image data in step S100 include:
[0052] Step S110 involves sequentially annotating the key point coordinates of all figures in the original image to generate corresponding real-world label files.
[0053] Step S120 vectorizes the original image and performs mean normalization and standard deviation normalization on the RGB channels;
[0054] Step S130 selects person image patches from the original image according to a preset number of values. For example, using a random person selection strategy, person image patches in the training dataset images are selected according to a preset maximum number of person image patches n, in order to reduce model complexity and introduce too many unnecessary cross-person correlations. For example, if the number of people in the image is greater than n, then one person image patch will be randomly selected and the nearest n-1 image patches will be selected with that as the center; if the number of people in the image is less than or equal to n, then all person image patches will be selected.
[0055] Through the above data preprocessing steps, the image blocks of people in the original image are combined with the corresponding real labels to form a feature map.
[0056] Step S200 involves dividing the feature map into regions and performing linear projection processing based on the preset block size S, obtaining query, key, and value tensors, and then using a primary routing algorithm to derive the affinity graph of the relevant regions in the feature map.
[0057] Step S300 filters out the most valuable region in the relevant region affinity map, performs attention calculation, obtains the corresponding weight matrix, and then inputs it into the multilayer perceptron to extract pose information.
[0058] In this invention, steps S200 to S300 actually propose a concept of dynamically perceiving sparse attention to extract posture information. The purpose is to accurately locate the most valuable key-value pairs at the coarse-grained region level while retaining the self-attention high-order spatial interaction capability, so as to participate in the final attention calculation, thereby eliminating irrelevant regions, that is, removing the negative impact of irrelevant regions on accuracy.
[0059] Specifically, before performing region partitioning and linear projection on the feature map, in an optional example, feature maps that are not divisible by a preset block size S can be padded. For example: given a two-dimensional input feature map First, it's necessary to determine if the height and width of the input feature map are divisible by S. S represents the height and width of the partitioned region. If the input feature map does not meet this condition, it needs to be padded. The formula for calculating the padding size is as follows:
[0060]
[0061] in , These represent the required padding size for the height and width of the input feature map, respectively.
[0062] Furthermore, in this example, the example steps of performing region segmentation and linear projection processing on the feature map in step S200 include:
[0063] Step S210: If the original height and width of the input feature map already meet the region partitioning conditions, then the feature map is reshaped into... The feature map is divided into S×S non-overlapping regions, such that each region contains A feature vector is used to divide the feature map into regions, where H, W, and S are the height, width, and area dimensions, respectively, and C is the channel size.
[0064] Step S220 by... Through linear projection processing, the query tensor Q, key tensor K, and value tensor V are derived:
[0065]
[0066] in , , These are the projection weights for the query, key, and value, respectively.
[0067] Furthermore, in this example, the example steps in step S200 for deriving the affinity graph of the relevant region in the feature map using the primary routing algorithm include:
[0068] Step S230 constructs an affinity graph of relevant regions using a primary routing algorithm, such as... Figure 3 As shown, this example demonstrates the process of locating the most relevant participating areas. While this example illustrates performing primary and secondary area routing sequentially to find the most valuable area, in other alternative implementations, primary area routing calculation can also be performed alone to find the most valuable area. Therefore, those skilled in the art can choose the implementation method based on the actual situation.
[0069] In this example, the primary routing algorithm finds the most valuable relevant regions by constructing an affinity graph.
[0070] For example: Average the values of each S×S region after dividing Q and K to derive the region-level queries and keys. , .
[0071] Step S240 is as shown in equation (3), through and The adjacency matrix of the region affinity graph is derived using matrix multiplication. :
[0072]
[0073] in , The numerical value in the value measures the degree of semantic association between two regions; the larger the value, the greater the correlation.
[0074] Furthermore, in this example, the step of filtering out the most valuable regions in the relevant region affinity map in step S300 includes:
[0075] Step S310 uses a primary routing algorithm to initially filter the most valuable areas. This is based on the adjacency matrix. A routing index matrix can be derived by applying row-level topK operations. Each row stores the m most relevant indexes that should participate in the calculation of the region:
[0076]
[0077] therefore, The i-th row contains the m indices of the most relevant regions in the i-th region. At this point, the primary routing for the relevant regions is complete. After the initial filtering, the m most relevant regions have been identified as the most valuable regions. The secondary region routing then begins to further extract the most relevant key-value pair regions.
[0078] Step S320 uses a two-level routing algorithm to calculate the scores of relevant regions. For example, taking the most valuable region obtained in step S310 as the center, a custom center-diffusion correlation convolution is used to calculate the scores of relevant regions and filter out the most valuable region a second time.
[0079] Specifically, the secondary routing algorithm fills in the gaps left by the primary routing algorithm, effectively addressing the issues of incorrect and missed routing choices. It is implemented using a custom center-diffusion correlation convolution (CCDC-Conv) and a region-level correlation score statistics matrix.
[0080] like Figure 4As shown, the black-boxed area represents the relevant region selected by the primary routing algorithm. This area is marked, and relevance is diffused from it. The closer to the black area, the higher the relevance score; conversely, the relevance score decreases with increasing distance. For example, a multiple decrease It was set to 0.5.
[0081] Specifically, first, an auxiliary matrix with an initial value of all zeros is created. This will be used to calculate the relevance score for each region. Then, the scatter function and the primary routing index matrix will be used. It is possible Find the selected relevant regions. However, achieving the goal of spreading and accumulating relevance scores centered on the marked regions is not easy; direct counting is impractical. This objective can be achieved through convolution.
[0082] Specifically, assuming we want to calculate the relevance score of the first ring around the labeled region, we can design a convolution with a kernel size of 3×3, a central region set to zero, and a weight of 1, and then combine it with... Convolution can be used to obtain the relevance score of the region surrounding the labeled region. For example... Figure 4 As shown in (b), the weights of the custom convolution are the relevance scores corresponding to the respective regions. Correspondingly, the relevance scores of the outer second and third rings can also be obtained by customizing the kernel sizes to 5×5 and 7×7 respectively, with weights of [missing values]. , 2. And the statistics are obtained by convolution with the center weights set to zero.
[0083] Notably, custom convolutions only interact with the target region. Perform effective convolution. The center region of the custom convolution is set to zero to avoid redundant calculations when performing correlation scores in the second and third rounds. After convolving the custom convolution with the auxiliary matrix, statistical results can be obtained as follows: Figure 4 (d) shows the region-level correlation score matrix. Regions with higher relevance scores are considered to be more relevant.
[0084] Then, by applying the aforementioned correlation score matrix... A routing index matrix can be derived by applying row-level topK operations. Each row stores the n most relevant indexes that should participate in the calculation of the region:
[0085]
[0086] like Figure 4 As shown in (d), the two region-to-region routing index matrices and The union of these ensures that almost all the most relevant regions that should be involved are selected.
[0087] Step S330 aggregates the most valuable regions obtained in steps S310 to S320 respectively.
[0088] Specifically, in order to perform subsequent fine-grained attention calculations, since the most relevant regions identified by the two routing algorithms may be scattered across different regions of the feature map, it is necessary to use the Gather and Concat functions to aggregate the relevant regions based on the region indices from the two routing algorithms. For example, to calculate:
[0089]
[0090]
[0091]
[0092]
[0093] in , . It is all the relevant key-value pairs that were finally collected.
[0094] Furthermore, an example step of attention calculation includes: step S340 applying fine-grained attention to the collected most relevant and valuable regions to obtain the corresponding weight matrix. This is then input into a multilayer perceptron for further pose information extraction.
[0095] Step S400 inputs the posture information into the encoder to refine the posture information and generates a key point heatmap to detect human posture.
[0096] Specifically, after flattening the features of the aforementioned output posture information, it is input into the cascaded Transformer encoder. By modeling the correlation between actions between people, the extracted posture is further refined. Then, through the rich posture feature information extracted from the entire network, a key point heatmap is generated, thereby detecting the corresponding human posture.
[0097] On the other hand, in order to further improve the accuracy of the main network in human pose estimation, in an optional implementation, the steps further include:
[0098] Step S500 performs noise occlusion processing on the feature map corresponding to the human pose detection result according to the occlusion index to generate a feature map.
[0099] Specifically, this step is used to build an inverse generation decision network to dynamically generate feature maps of occluded samples, thereby iteratively training the main network, such as... Figure 5 As shown, the example steps include:
[0100] Step S510 first selects an appropriate occlusion index based on the previously obtained pose estimation results to help generate the network, and performs noise occlusion processing on the corresponding original input image to generate an occluded image.
[0101] Step S520: Then train the inverse generation decision network and continuously judge whether the occluded image generated by the generation network meets the specified occlusion index. In this way, train the generation network to generate an occluded image with the corresponding degree of occlusion based on the specified occlusion index and the original image.
[0102] After step S530, a decision network is used to determine whether the occlusion index of the generated image is consistent with the target index. If they are consistent, the generated image is output; otherwise, the generation network is trained by the decision network loss and the training image is regenerated.
[0103] Step S540 repeats all steps of step S520, using the generated occluded images to train the main network's learning and generalization ability on occluded images until the algorithm's performance reaches saturation. This further comprehensively improves the accuracy of human pose estimation.
[0104] On the other hand, corresponding to the above methods, such as Figure 6 As shown, the present invention also provides a human pose estimation system based on dynamic perception sparse attention, which includes:
[0105] The training set processing unit is used to preprocess the raw image data and obtain feature maps of people in the image.
[0106] The main network unit is used to divide the feature map into regions and perform linear projection processing according to the preset block size S, obtain the query tensor Q, key tensor K, and value tensor V, and derive the affinity map of the relevant regions in the feature map through the primary routing algorithm. After passing through the first-level and second-level routing algorithms, the most valuable regions in the affinity map of the relevant regions are filtered out and fused. Then, attention is calculated to obtain the corresponding weight matrix, which is then input into the multilayer perceptron to extract the pose information, and input into the encoder to refine the pose information. Based on this, a key point heatmap is generated to detect human pose.
[0107] The inverse generation decision network unit is used to process the feature map corresponding to the human pose detection result by noise occlusion according to the occlusion index, and generate a feature map to train the main network.
[0108] On the other hand, corresponding to the above method, the present invention also provides a computer-readable storage medium storing a computer program, wherein when the computer program is executed, it implements the steps of the human pose estimation method based on dynamic perception sparse attention as described in any of the above examples.
[0109] In summary, the human pose estimation method and system based on dynamic perception sparse attention of this invention can accurately locate the most valuable key-value pairs at the coarse-grained region level to participate in the final attention calculation while retaining the high-order spatial interaction capability of self-attention. This eliminates irrelevant regions, thereby removing their negative impact on accuracy and improving the accuracy of human pose estimation. At the same time, it can significantly reduce the computational complexity of the algorithm. Furthermore, the algorithm of this invention uses a master detection network based on dynamic perception sparse attention for pose learning and detection, and uses an inverse generation decision network to dynamically generate difficult images with different occlusion indices based on the pose estimation results of the master network and add them to the training. This can help improve the representation and generalization ability of the master detection network for occluded samples, thereby further improving the overall accuracy of human pose estimation.
[0110] The preferred embodiments of the present invention disclosed above are merely illustrative of the invention. These preferred embodiments do not exhaustively describe all details, nor do they limit the invention to the specific implementations described. Clearly, many modifications and variations can be made based on the content of this specification. This specification selects and specifically describes these embodiments to better explain the principles and practical applications of the invention, thereby enabling those skilled in the art to better understand and utilize the invention. The present invention is limited only by the claims and their full scope and equivalents. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the invention should be included within the protection scope of the invention.
[0111] Those skilled in the art will understand that, besides implementing the system, apparatus, unit, and its modules provided by this invention in purely computer-readable program code, the same program can be implemented in the form of logic gates, switches, application-specific integrated circuits, programmable logic controllers, and embedded microcontrollers by logically programming the method steps. Therefore, the system, apparatus, and its modules provided by this invention can be considered a hardware component, and the modules included therein for implementing various programs can also be considered structures within the hardware component; alternatively, modules for implementing various functions can be considered both software programs implementing the method and structures within the hardware component.
[0112] Furthermore, all or part of the steps in the methods of the above embodiments can be implemented by a program instructing related hardware. This program is stored in a storage medium and includes several instructions to cause a microcontroller, chip, or processor to execute all or part of the steps of the methods described in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as a USB flash drive, a portable hard drive, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk.
[0113] Furthermore, various different implementations of the present invention can be combined arbitrarily, as long as they do not violate the spirit of the present invention, they should also be regarded as the content disclosed in the present invention.
Claims
1. A human pose estimation method based on dynamic perception sparse attention, comprising the following steps: Step S100: Preprocess the raw image data to obtain feature maps of the people in the image; Step S200 involves dividing the feature map into regions and performing linear projection processing based on a preset block size to obtain query, key, and value tensors. This is then used to derive the affinity graph of relevant regions in the feature map via a primary routing algorithm. The steps include: Step S210 involves reshaping the feature map into... The feature map is divided into S×S non-overlapping regions, such that each region contains There are 3 feature vectors, where H, W, and S are the height, width, and area dimensions, respectively, and C is the channel size; Step S220 by... The linear projections derive the query tensor Q, key tensor K, and value tensor V: ; in , , These are the projection weights for the query, key, and value, respectively. Step S230 calculates the average for each S×S region after dividing Q and K, deriving the region-level query and key. , ; Step S240 through and The adjacency matrix of the region affinity graph is derived using matrix multiplication. : ; in ; Step S300 filters out the most valuable regions in the affinity graph of the relevant regions, performs attention calculation, obtains the corresponding weight matrix, and then inputs it into the multilayer perceptron to extract pose information. The steps include: Step S310 based on the adjacency matrix... By applying topK operations, the routing index matrix is derived. : ; To identify the m most valuable regions; Step S320 uses the most valuable region obtained in step S310 as the center, and after applying a custom center-diffusion correlation convolution, calculates the scores of related regions to filter out the most valuable region a second time; the step of custom center-diffusion correlation convolution includes: The correlation scores of the first ring surrounding the statistically labeled region are calculated. A convolution with a kernel size of 3×3 and a weight of 1 (with the central region set to zero) is designed and then combined with... Convolution is performed to obtain the correlation score of the region surrounding the labeled region; the weights of the custom convolution are the correlation scores corresponding to the range of regions; the correlation scores of the second and third outer convolutions are also obtained by customizing the kernel sizes to 5×5 and 7×7 respectively, with weights of [missing values]. , 2. Convolution statistics with center weights set to zero; Step S330 aggregates the most valuable regions obtained in steps S310 to S320 respectively; Step S400 inputs the posture information into the encoder to refine the posture information and generates a key point heatmap to detect human posture.
2. The human pose estimation method based on dynamic perception sparse attention according to claim 1, wherein step S200 further includes: Fill in feature maps that cannot be divided evenly by the preset block size.
3. The human pose estimation method based on dynamic perception sparse attention according to claim 1, wherein the step of preprocessing the original image data in step S100 includes: Generate real-world labels by annotating the coordinates of key points of people in the original image; The original image is vectorized, and mean and standard deviation normalization is performed on the RGB channels; Select the human image blocks from the original image according to the preset number of values; The image patch of the person is combined with the corresponding real label to form a feature map.
4. The human pose estimation method based on dynamic perception sparse attention according to claim 1, wherein the steps further include: Step S500 performs noise occlusion processing on the feature map corresponding to the human pose detection result according to the occlusion index to generate a feature map.
5. A human pose estimation system based on dynamic perception sparse attention, used to perform the method as described in any one of claims 1 to 4, comprising: The training set processing unit is used to preprocess the raw image data and obtain feature maps of people in the image. The main network unit is used to divide the feature map into regions and perform linear projection processing according to the preset block size, obtain query, key, and value tensors, and derive the affinity map of the relevant regions in the feature map through the primary routing algorithm. After passing through the primary and secondary routing algorithms, the most valuable regions in the affinity map of the relevant regions are filtered out and fused. Then, attention is calculated to obtain the corresponding weight matrix, which is then input into the multilayer perceptron to extract pose information, and input into the encoder to refine the pose information. Based on this, a key point heatmap is generated to detect human pose. The inverse generation decision network unit is used to process the feature map corresponding to the human pose detection result by noise occlusion according to the occlusion index, and generate a feature map to train the main network.
6. A computer-readable storage medium storing a computer program, wherein when executed, the computer program implements the steps of the human pose estimation method based on dynamic perception sparse attention as described in any one of claims 1 to 4.