A posture recognition system and method based on spatial cross convolution

By using an improved lightweight human skeleton extraction network, combined with the MobileNetV3 network featuring spatial cross-convolution and attention mechanisms, the problems of excessive parameters and slow inference speed in existing pose estimation algorithms on embedded devices are solved, enabling fast and accurate sitting posture recognition on edge devices.

CN115601789BActive Publication Date: 2026-06-30LOCTEK ERGONOMIC TECH CORP

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
LOCTEK ERGONOMIC TECH CORP
Filing Date
2022-10-24
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing attitude estimation algorithms have a large number of parameters, making them unsuitable for embedded devices. Furthermore, LightweightOpenPose has too many parameters in the prediction phase, which reduces inference speed.

Method used

A lightweight human skeleton extraction network based on spatial cross-convolution is adopted, and the MobileNetV3 network with attention mechanism is combined for feature extraction. Spatial cross-convolutional layers are used to replace some standard convolutional layers to construct an improved lightweight human skeleton extraction network.

Benefits of technology

While reducing the number of model parameters, it significantly improves the model inference speed, enabling fast posture recognition on edge devices with minimal decrease in accuracy.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN115601789B_ABST
    Figure CN115601789B_ABST
Patent Text Reader

Abstract

This invention provides a sitting posture recognition system and method based on spatial cross-convolution, relating to the field of deep learning technology. The method includes: extracting features from a human image using a feature extraction network to obtain a first feature map; extracting the human skeleton from the first feature map using an improved lightweight human skeleton extraction network to obtain a human skeleton map; the improved lightweight human skeleton extraction network includes an initialization network and at least one correction network; the initialization network includes a first initialization branch and a second initialization branch, both formed by stacking multiple standard convolutional layers and multiple spatial cross-convolutional layers; the correction network includes a first correction branch and a second correction branch, both formed by stacking multiple convolutional blocks, with some convolutional blocks using spatial cross-convolutional layers to partially replace standard convolutional layers; and performing sitting posture recognition based on the human skeleton map to obtain the sitting posture recognition result of the human image. The beneficial effect is that it can significantly improve the model inference speed with only a small decrease in model accuracy.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of deep learning technology, and in particular to a sitting posture recognition system and method based on spatial cross convolution. Background Technology

[0002] Currently, posture recognition algorithms widely employ pose estimation to extract human skeletal features before recognizing posture. However, existing pose estimation algorithms have a large number of parameters, making them unsuitable for deployment on embedded devices. LightweightOpenPose is a lightweight human pose estimation algorithm that can perform fast inference on a CPU, but its extensive use of traditional convolutional operations in the prediction phase leads to redundant parameters, reducing inference speed. Therefore, there is an urgent need for a posture recognition technology that can be deployed on embedded devices while also offering high inference speed. Summary of the Invention

[0003] To address the problems existing in the prior art, this invention provides a posture recognition system based on spatial cross-convolution, comprising:

[0004] The feature extraction module is used to extract features from the input human image using a pre-built feature extraction network to obtain the corresponding first feature map;

[0005] A human skeleton extraction module, connected to the feature extraction module, is used to extract the human skeleton from the first feature map using an improved lightweight human skeleton extraction network to obtain the human skeleton map contained in the first feature map.

[0006] The improved lightweight human skeleton extraction network includes an initialization network and at least one correction network connected to the initialization network;

[0007] The initialization network includes a first initialization branch and a second initialization branch, both of which are formed by stacking multiple standard convolutional layers and multiple spatial cross convolutional layers. They are respectively used to locate key points and combine key points in the first feature map to obtain an initial key point heatmap and an initial part affinity field heatmap.

[0008] The correction network includes a first correction branch and a second correction branch, both of which are formed by stacking multiple convolutional blocks. In some of the convolutional blocks, the spatial cross convolutional layer is used to replace the standard convolutional layer. These branches are used to locate and combine key points in the second feature map formed by superimposing the first feature map, the initial key point heatmap, and the initial part affinity field heatmap, respectively, to obtain the corrected key point heatmap and the corrected part affinity field heatmap, thereby constructing the human skeleton map.

[0009] The sitting posture recognition module is connected to the human skeleton extraction module and is used to perform sitting posture recognition based on the human skeleton diagram to obtain the sitting posture recognition result of the human image.

[0010] Preferably, the feature extraction network is a MobileNetV3 network with an added attention mechanism.

[0011] Preferably, the first initialization branch and the second initialization branch include two spatially cross convolutional layers and three standard convolutional layers connected in sequence.

[0012] Preferably, the first correction branch and the second correction branch include a first convolutional block, a second convolutional block, a third convolutional block, a fourth convolutional block, a fifth convolutional block, and two standard convolutional layers connected in sequence;

[0013] The first convolutional block, the third convolutional block, and the fifth convolutional block each comprise three standard convolutional layers connected in sequence.

[0014] The second convolutional block and the fourth convolutional block each include three convolutional layers connected in sequence, wherein the first and third convolutional layers are the standard convolutional layers, and the second convolutional layer is the spatial cross convolutional layer.

[0015] Preferably, the spatially cross-convolutional layer comprises:

[0016] An adaptive positional encoding module is used to perform positional encoding on the input feature map to obtain an encoded feature map, wherein each pixel in the encoded feature map is marked with the position information of the pixel in the input feature map;

[0017] A spatial separation and recombination module, connected to the adaptive position encoding module, is used to recombine the pixels of each channel in the encoded feature map to obtain a recombined feature map, wherein the recombined feature map contains feature information of all channels;

[0018] A depthwise separable convolution module, connected to the spatial separation and reconstruction module, is used to sequentially perform channel-wise convolution and point-wise convolution processing on the reconstructed feature map.

[0019] Preferably, the adaptive position encoding module includes:

[0020] The position encoding unit is used to feed the input feature map into a 3*3 group convolution to generate a position mapping feature map;

[0021] The feature fusion unit, connected to the position encoding unit, is used to fuse the input feature map with the position mapping feature map to obtain the encoded feature map.

[0022] Preferably, in the spatial separation and recombination module, the pixels of each channel in the encoded feature map are recombined using the following formula:

[0023] ;

[0024] in, Used to represent the recombination feature map Used to represent the encoded feature map Used to represent the matrix transpose function, 1, 2, 3 are used to represent the first dimension, second dimension, and third dimension of the corresponding feature map, respectively.

[0025] This invention also provides a sitting posture recognition method based on spatial cross-convolution, applied to the aforementioned sitting posture recognition system, the sitting posture recognition method comprising:

[0026] Step S1: Extract features from the input human image using a pre-constructed feature extraction network to obtain the corresponding first feature map;

[0027] Step S2: The first feature map is fed into an improved lightweight human skeleton extraction network to obtain the human skeleton map contained in the first feature map.

[0028] The improved lightweight human skeleton extraction network includes an initialization network and at least one correction network connected to the initialization network;

[0029] The initialization network includes a first initialization branch and a second initialization branch, both of which are formed by stacking multiple standard convolutional layers and multiple spatial cross convolutional layers. They are respectively used to locate key points and combine key points in the first feature map to obtain an initial key point heatmap and an initial part affinity field heatmap.

[0030] The correction network includes a first correction branch and a second correction branch, both of which are formed by stacking multiple convolutional blocks. In some of the convolutional blocks, the spatial cross convolutional layer is used to replace the standard convolutional layer. These branches are used to locate and combine key points in the second feature map formed by superimposing the first feature map, the initial key point heatmap, and the initial part affinity field heatmap, respectively, to obtain the corrected key point heatmap and the corrected part affinity field heatmap, thereby constructing the human skeleton map.

[0031] Step S3: Perform posture recognition based on the human skeleton diagram to obtain the posture recognition result of the human image.

[0032] Preferably, the feature extraction network is a MobileNetV3 network with an added attention mechanism.

[0033] Preferably, the spatially cross-convolutional layer comprises:

[0034] An adaptive positional encoding module is used to perform positional encoding on the input feature map to obtain an encoded feature map, wherein each pixel in the encoded feature map is marked with the position information of the pixel in the input feature map;

[0035] A spatial separation and recombination module, connected to the adaptive position encoding module, is used to recombine the pixels of each channel in the encoded feature map to obtain a recombined feature map, wherein the recombined feature map contains feature information of all channels;

[0036] A depthwise separable convolution module, connected to the spatial separation and reconstruction module, is used to sequentially perform channel-wise convolution and point-wise convolution processing on the reconstructed feature map.

[0037] The above technical solution has the following advantages or beneficial effects: The lightweight human pose estimation network framework based on LightweightOpenPose uses the MobileNetV3 network with added attention mechanism for image feature extraction, which can obtain feature information with higher importance weights. At the same time, it uses spatial cross convolutional layers to replace the traditional standard convolutional layers, which can obtain rich global feature information of the image and reduce the number of model parameters. Thus, it can significantly improve the model inference speed with minimal decrease in model accuracy, and can achieve fast sitting posture recognition on edge devices. Attached Figure Description

[0038] Figure 1 A schematic diagram of a posture recognition system based on spatial cross-convolution is provided in a preferred embodiment of the present invention.

[0039] Figure 2 A schematic diagram of the network structure of the feature extraction network and the improved lightweight human skeleton extraction network in a preferred embodiment of the present invention.

[0040] Figure 3 This is a schematic diagram of the structure of a spatially cross-convolutional layer in a preferred embodiment of the present invention.

[0041] Figure 4 This is a schematic diagram of the spatial cross-separation and recombination process in a preferred embodiment of the present invention;

[0042] Figure 5 This is a flowchart illustrating a sitting posture recognition method based on spatial cross-convolution, which is a preferred embodiment of the present invention. Detailed Implementation

[0043] The present invention will now be described in detail with reference to the accompanying drawings and specific embodiments. The present invention is not limited to this embodiment; other embodiments that conform to the spirit of the present invention may also fall within the scope of the present invention.

[0044] In a preferred embodiment of the present invention, based on the above-mentioned problems existing in the prior art, a posture recognition system based on spatial cross-convolution is provided, such as... Figure 1 and Figure 2 As shown, it includes:

[0045] Feature extraction module 1 is used to extract features from the input human image using a pre-built feature extraction network to obtain the corresponding first feature map;

[0046] Human skeleton extraction module 2, connected to feature extraction module 1, is used to extract the human skeleton from the first feature map using an improved lightweight human skeleton extraction network to obtain the human skeleton map contained in the first feature map.

[0047] The improved lightweight human skeleton extraction network includes an initialization network 100 and at least one correction network 200 connecting the initialization network 100;

[0048] The initialization network 100 includes a first initialization branch 101 and a second initialization branch 102, both of which are formed by stacking multiple standard convolutional layers and multiple spatial cross convolutional layers. They are respectively used to locate key points and combine key points in the first feature map to obtain the initial key point heat map and the initial part affinity field heat map.

[0049] The correction network 200 includes a first correction branch 201 and a second correction branch 202, both of which are formed by stacking multiple convolutional blocks. In some convolutional blocks, spatial cross convolutional layers are used to replace standard convolutional layers. These are used to locate and combine key points in the second feature map formed by superimposing the first feature map, the initial key point heatmap, and the initial part affinity field heatmap, respectively, so as to obtain the corrected key point heatmap and the corrected part affinity field heatmap, thereby constructing a human skeleton map.

[0050] The sitting posture recognition module 3 is connected to the human skeleton extraction module 2 and is used to perform sitting posture recognition based on the human skeleton diagram to obtain the sitting posture recognition result of the human image.

[0051] Specifically, in this embodiment, the improved lightweight human skeleton extraction network is based on the lightweight human pose estimation network framework of LightweightOpenPose. The original feature extraction network in LightweightOpenPose is improved by adding an attention mechanism to the MobileNetV3 network, enabling it to obtain feature information with higher importance weights. Considering that standard convolutions can efficiently learn the overall features of all channels of the input feature map, but have a large number of parameters and high computational cost, consuming significant resources when deployed on edge devices and greatly reducing the inference speed of the network model, this embodiment uses spatial cross-convolutional layers to replace some of the traditional convolutions (i.e., standard convolutional layers) in the prediction stage of the LightweightOpenPose network. Spatial cross-convolutional layers can obtain rich global feature information from the image and reduce the number of model parameters, ultimately significantly improving the model's inference speed with only a small decrease in model accuracy.

[0052] More specifically, the first initialization branch 101 and the second initialization branch 102 include two spatially cross-convolutional layers C1 and three standard convolutional layers C2 connected in sequence.

[0053] Specifically, in this embodiment, the spatial cross convolutional layer C1 adopts a 3*3 convolution. Among the three standard convolutional layers C2 connected to the spatial cross convolutional layer C1, the first standard convolutional layer adopts a 3*3 convolution, the second standard convolutional layer connected to the first standard convolutional layer adopts a 3*3 convolution, and the third standard convolutional layer connected to the second standard convolutional layer adopts a 1*1 convolution.

[0054] In a preferred embodiment of the present invention, the first correction branch 201 and the second correction branch 202 include a first convolutional block L1, a second convolutional block L2, a third convolutional block L3, a fourth convolutional block L4, a fifth convolutional block L5 and two standard convolutional layers C2 connected in sequence.

[0055] The first convolutional block L1, the third convolutional block L3, and the fifth convolutional block L5 consist of three standard convolutional layers C2 connected in sequence;

[0056] The second convolutional block L2 and the fourth convolutional block L4 consist of three convolutional layers connected in sequence, wherein the first and third convolutional layers are standard convolutional layers C2, and the second convolutional layer is a spatially cross convolutional layer C1.

[0057] Specifically, in this embodiment, among the three standard convolutional layers C2 sequentially connected in the first convolutional block L1, the third convolutional block L3, and the fifth convolutional block L5, the first standard convolutional layer uses a 1x1 convolution, the second standard convolutional layer uses a 3x3 convolution, and the third standard convolutional layer uses a 3x3 convolution. Among the three convolutional layers sequentially connected in the second convolutional block L2 and the fourth convolutional block L4, the first convolutional layer is a 1x1 standard convolutional layer, the second convolutional layer is a 3x3 spatially interleaved convolutional layer, and the third convolutional layer is a 3x3 standard convolutional layer. The two standard convolutional layers C2 connected to the output of the fifth convolutional block L5 are both 1x1 convolutions.

[0058] In a preferred embodiment of the present invention, such as Figure 3 As shown, the spatially cross-convolutional layer C1 includes:

[0059] The adaptive position encoding module C11 is used to perform position encoding on the input feature map to obtain the encoded feature map. Each pixel in the encoded feature map is marked with the position information of the pixel in the input feature map.

[0060] The spatial separation and recombination module C12 is connected to the adaptive position encoding module C11. It is used to recombine the pixels of each channel in the encoded feature map to obtain a recombined feature map. The recombined feature map contains the feature information of all channels.

[0061] The depthwise separable convolution module C13 connects to the spatially separated reconstruction module C12, which is used to perform channel-wise convolution and point-wise convolution on the reconstructed feature map in sequence.

[0062] Specifically, while standard convolutions can efficiently learn the overall features of all channels in the input feature map, their large number of parameters and computational demands result in significant resource consumption during deployment on edge devices, drastically reducing the inference speed of the network model. To improve this speed, standard convolutions need to be improved. While the depthwise separable convolution proposed in MobileNetV1 effectively reduces convolution parameters, its depth-wise (DW) convolution only learns features from a single channel, ignoring information from other channels. Therefore, this technical solution, based on depthwise separable convolutions, employs spatial cross-convolutional layers to separate and reorganize spatial pixels across all channels. Furthermore, it incorporates information from other channels within a single channel, enabling each convolutional kernel to learn global channel information during DW convolution.

[0063] More specifically, since spatial pixels on all channels need to be separated and recombined, after spatial cross-separation and recombination, the pixels in the feature map will leave their original positions, causing spatial disorder, which is detrimental to algorithm learning. Therefore, before spatial separation and recombination, the position information of each pixel needs to be marked. In this embodiment, the input feature map is position-encoded by an adaptive position encoding module C11. In a preferred embodiment of the present invention, the adaptive position encoding module C11 includes:

[0064] The position encoding unit C111 is used to feed the input feature map into a 3*3 group convolution to generate a position mapping feature map;

[0065] The feature fusion unit C112 is connected to the position encoding unit C111 and is used to fuse the input feature map with the position mapping feature map to obtain the encoded feature map.

[0066] Specifically, in this embodiment, the position encoding can be performed using the following formula:

[0067] ;

[0068] in, Used to represent the positional encoding function, it is a 3x3 group convolution. Indicates the input feature map, This represents the encoded feature map.

[0069] More specifically, before spatial crossing of the input feature map, a single-layer 3*3 group convolution is used to achieve a feature mapping of the same size as the input feature map, that is, to obtain the above-mentioned positional mapping feature map, which is used to represent the positional information of the original input feature map. Then, the encoded positional information is fused with the original input feature map, so that each pixel of the feature map retains the possibility of the original positional information.

[0070] After obtaining an encoded feature map with the positional information of each pixel through positional encoding, spatial separation and recombination can then be performed on the encoded feature map. Since each kernel of a depthwise convolution only operates on the corresponding intra-group channel and ignores the features of other channels, this embodiment proposes using a spatial crossover method to spatially separate and recombine all channels of the input feature map, thus combining the feature information of all channels. The spatial crossover operation sequentially extracts one pixel from each channel and reassembles them in order. The resulting recombined feature map is the same size as the input encoded feature map. Taking an encoded feature map where each of the two channels contains 4 pixels as an example... Figure 4As shown, the first pixel of the first channel, the first pixel of the second channel, the second pixel of the first channel, and the second pixel of the second channel can be extracted sequentially to obtain the separated and recombined result of the first channel, and so on. The above explains the spatial cross-separation and recombination process from a principle perspective. During execution, the pixels of each channel in the encoded feature map can be recombined using the following formula:

[0071] ;

[0072] in, Used to represent recombination feature maps Used to represent the encoded feature map Used to represent the matrix transpose function, 1, 2, 3 are used to represent the first dimension, second dimension, and third dimension of the corresponding feature map, respectively.

[0073] Specifically, in this embodiment, based on the above formula, the transposed image obtained by transposing the second and third dimensions of the encoded feature map, and then the reconstructed feature map by transposing the first and third dimensions, can be obtained.

[0074] After obtaining the reconstructed feature map, it is then fed into a depthwise separable convolution to complete the spatial cross-convolution. Depthwise separable convolution, proposed by Andrew et al., is an operation used to replace standard convolution. Compared to standard convolution, it reduces computation by 8 times with minimal decrease in accuracy, and is widely used in computer vision. It mainly consists of two parts: DW convolution (channel-wise convolution) and point-wise (PW) convolution. DW convolution is a group convolution with the same number of groups as the input channels, meaning it outputs the same feature map as the input channels. PW convolution is a set of 1×1 standard convolutions, mainly used to combine feature information from all channels and adjust the output channels.

[0075] As a preferred implementation, assuming the input feature map size is F=[128,256,256], the convolution kernel size is 3*3, the output channels are 128, the stride is 1, and the zero padding is 1, the parameter count and computational cost of standard convolution, depthwise separable convolution, and spatial cross convolution are calculated respectively, and the results are shown in Table 1 below:

[0076] Table 1 Comparison of the number of parameters and computational cost for three types of convolution.

[0077]

[0078] The comparison revealed that the standard convolution has the largest number of parameters and computational cost, approximately eight times that of the depthwise separable convolution. In contrast, the spatial cross convolution only adds a position information mapping layer compared to the depthwise separable convolution, so the difference in the number of parameters and computational cost is not significant.

[0079] As another preferred implementation, the network model of this technical solution (including a feature extraction network and an improved lightweight human skeleton extraction network) can be trained based on the COCO training set. Taking 280 training epochs as an example, the training phase can be divided into three parts: the first part is to set the prediction stage (refinement-stage) to 1 (i.e., configure a correction network), load the pre-trained parameters of the MobilNetV3 network, train for 80 epochs, and save the model and optimizer parameters; the second part is to reload the parameters saved in the previous epoch and continue training for 100 epochs, and save the model and optimizer parameters; the third part is to set the refinement-stage to 3 (i.e., configure three sequentially connected correction networks), load the parameters saved in the previous epoch, and continue training for 100 epochs. The final results are shown in Table 2 below:

[0080] Table 2 shows the performance of this technical solution on the COCO validation set:

[0081]

[0082] As can be seen from Table 2 above, by improving the feature extraction network and the lightweight human skeleton extraction network, the number of model parameters can be reduced, and the model inference speed can be significantly improved with minimal decrease in model accuracy.

[0083] As another preferred implementation, the inference part of the network model (including a feature extraction network and an improved lightweight human skeleton extraction network) of this technical solution preferably uses the Microsoft® ONNXRuntime inference framework. This framework is a cross-platform machine learning model accelerator with flexible interfaces to integrate specific hardware libraries, and can be optimized for hardware on different platforms, such as GPUs, CPUs, and FPGAs, to accelerate inference. The model's performance was tested on test data after accelerating inference. Two 720P videos from YouTube were used as test data, each containing more than 20 poses. The final video test results show that the network model of this technical solution infers a 720*1280 image in only about 160ms, while light-weight OpenPose takes about 660ms, a speed improvement of nearly 4 times. In terms of model parameters, this technical solution also reduces the number of parameters by about 22% compared to light-weight OpenPose. Specific results are shown in Table 3 below.

[0084] Table 3 Comparison of inference speed on 720P video.

[0085]

[0086] As can be seen from Table 3 above, this technical solution reduces the number of model parameters while possessing excellent FPS (inference speed).

[0087] This invention also provides a sitting posture recognition method based on spatial cross-convolution, which can be applied to the aforementioned sitting posture recognition system, such as... Figure 5 As shown, the sitting posture recognition method includes:

[0088] Step S1: Extract features from the input human image using a pre-constructed feature extraction network to obtain the corresponding first feature map;

[0089] Step S2: The first feature map is fed into the improved lightweight human skeleton extraction network for processing to obtain the human skeleton map contained in the first feature map.

[0090] The improved lightweight human skeleton extraction network includes an initialization network and at least one modified network that connects to the initialization network;

[0091] The initialization network includes a first initialization branch and a second initialization branch, both of which are formed by stacking multiple standard convolutional layers and multiple spatial cross convolutional layers. They are used to locate key points and combine key points in the first feature map to obtain the initial key point heatmap and the initial part affinity field heatmap, respectively.

[0092] The correction network includes a first correction branch and a second correction branch, both of which are formed by stacking multiple convolutional blocks. In some convolutional blocks, spatial cross-convolutional layers are used to replace standard convolutional layers. These are used to locate and combine key points in the second feature map formed by superimposing the first feature map, the initial key point heatmap, and the initial part affinity field heatmap, respectively, so as to obtain the corrected key point heatmap and the corrected part affinity field heatmap, and then construct the human skeleton map.

[0093] Step S3: Perform posture recognition based on the human skeleton diagram to obtain the posture recognition result of the human image.

[0094] In a preferred embodiment of the present invention, the feature extraction network is a MobileNetV3 network with an added attention mechanism.

[0095] In a preferred embodiment of the present invention, the spatially cross-convolutional layer includes:

[0096] The adaptive positional encoding module is used to perform positional encoding on the input feature map to obtain the encoded feature map. Each pixel in the encoded feature map is marked with the pixel's position information in the input feature map.

[0097] The spatial separation and recombination module, connected to the adaptive position encoding module, is used to recombine the pixels of each channel in the encoded feature map to obtain a recombined feature map, which contains feature information of all channels.

[0098] The depthwise separable convolution module and the connection spatial separation and reconstruction module are used to perform channel-wise convolution and point-wise convolution on the reconstructed feature map in sequence.

[0099] The above description is merely a preferred embodiment of the present invention and does not limit the implementation and protection scope of the present invention. Those skilled in the art should realize that any equivalent substitutions and obvious changes made using the content of this specification and illustrations should be included within the protection scope of the present invention.

Claims

1. A posture recognition system based on spatial cross-convolution, characterized in that, include: The feature extraction module is used to extract features from the input human image using a pre-built feature extraction network to obtain the corresponding first feature map; The feature extraction network is a MobileNetV3 network with an added attention mechanism; A human skeleton extraction module, connected to the feature extraction module, is used to extract the human skeleton from the first feature map using an improved lightweight human skeleton extraction network to obtain the human skeleton map contained in the first feature map. The improved lightweight human skeleton extraction network includes an initialization network and at least one correction network connected to the initialization network; The initialization network includes a first initialization branch and a second initialization branch, both of which are formed by stacking multiple standard convolutional layers and multiple spatial cross convolutional layers. They are respectively used to locate key points and combine key points in the first feature map to obtain an initial key point heatmap and an initial part affinity field heatmap. The correction network includes a first correction branch and a second correction branch, both of which are formed by stacking multiple convolutional blocks. In some of the convolutional blocks, the spatial cross convolutional layer is used to replace the standard convolutional layer. These branches are used to locate and combine key points in the second feature map formed by superimposing the first feature map, the initial key point heatmap, and the initial part affinity field heatmap, respectively, to obtain the corrected key point heatmap and the corrected part affinity field heatmap, thereby constructing the human skeleton map. The sitting posture recognition module is connected to the human skeleton extraction module and is used to perform sitting posture recognition based on the human skeleton diagram to obtain the sitting posture recognition result of the human image.

2. The posture recognition system according to claim 1, characterized in that, The first initialization branch and the second initialization branch each include two spatially cross convolutional layers and three standard convolutional layers connected in sequence.

3. The posture recognition system according to claim 1, characterized in that, The first correction branch and the second correction branch include a first convolutional block, a second convolutional block, a third convolutional block, a fourth convolutional block, a fifth convolutional block, and two standard convolutional layers connected in sequence; The first convolutional block, the third convolutional block, and the fifth convolutional block each comprise three standard convolutional layers connected in sequence. The second convolutional block and the fourth convolutional block each include three convolutional layers connected in sequence, wherein the first and third convolutional layers are the standard convolutional layers, and the second convolutional layer is the spatial cross convolutional layer.

4. The sitting posture recognition system according to claim 1, 2, or 3, characterized in that, The spatially cross-convolutional layer includes: An adaptive positional encoding module is used to perform positional encoding on the input feature map to obtain an encoded feature map, wherein each pixel in the encoded feature map is marked with the position information of the pixel in the input feature map; A spatial separation and recombination module, connected to the adaptive position encoding module, is used to recombine the pixels of each channel in the encoded feature map to obtain a recombined feature map, wherein the recombined feature map contains feature information of all channels; A depthwise separable convolution module, connected to the spatial separation and reconstruction module, is used to sequentially perform channel-wise convolution and point-wise convolution processing on the reconstructed feature map.

5. The posture recognition system according to claim 4, characterized in that, The adaptive position encoding module includes: The position encoding unit is used to feed the input feature map into a 3*3 group convolution to generate a position mapping feature map; The feature fusion unit, connected to the position encoding unit, is used to fuse the input feature map with the position mapping feature map to obtain the encoded feature map.

6. The posture recognition system according to claim 4, characterized in that, In the spatial separation and recombination module, the pixels of each channel in the encoded feature map are recombined using the following formula: ; in, Used to represent the recombination feature map Used to represent the encoded feature map Used to represent the matrix transpose function, 1, 2, 3 are used to represent the first dimension, second dimension, and third dimension of the corresponding feature map, respectively.

7. A posture recognition method based on spatial cross-convolution, characterized in that, The posture recognition method, applied to the posture recognition system as described in any one of claims 1-6, comprises: Step S1: Extract features from the input human image using a pre-constructed feature extraction network to obtain the corresponding first feature map; Step S2: The first feature map is fed into an improved lightweight human skeleton extraction network to obtain the human skeleton map contained in the first feature map. The improved lightweight human skeleton extraction network includes an initialization network and at least one correction network connected to the initialization network; The initialization network includes a first initialization branch and a second initialization branch, both of which are formed by stacking multiple standard convolutional layers and multiple spatial cross convolutional layers. They are respectively used to locate key points and combine key points in the first feature map to obtain an initial key point heatmap and an initial part affinity field heatmap. The correction network includes a first correction branch and a second correction branch, both of which are formed by stacking multiple convolutional blocks. In some of the convolutional blocks, the spatial cross convolutional layer is used to replace the standard convolutional layer. These branches are used to locate and combine key points in the second feature map formed by superimposing the first feature map, the initial key point heatmap, and the initial part affinity field heatmap, respectively, to obtain the corrected key point heatmap and the corrected part affinity field heatmap, thereby constructing the human skeleton map. Step S3: Perform posture recognition based on the human skeleton diagram to obtain the posture recognition result of the human image.

8. The sitting posture recognition method according to claim 7, characterized in that, The feature extraction network is a MobileNetV3 network with an added attention mechanism.

9. The sitting posture recognition method according to claim 7, characterized in that, The spatially cross-convolutional layer includes: An adaptive positional encoding module is used to perform positional encoding on the input feature map to obtain an encoded feature map, wherein each pixel in the encoded feature map is marked with the position information of the pixel in the input feature map; A spatial separation and recombination module, connected to the adaptive position encoding module, is used to recombine the pixels of each channel in the encoded feature map to obtain a recombined feature map, wherein the recombined feature map contains feature information of all channels; A depthwise separable convolution module, connected to the spatial separation and reconstruction module, is used to sequentially perform channel-wise convolution and point-wise convolution processing on the reconstructed feature map.