Gesture recognition method and model training method and device based on neural network model

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By using a neural network model that integrates gesture semantic features and key point features, this method solves the accuracy problem of gesture recognition in complex scenarios in existing technologies, and achieves high-accuracy recognition even under occlusion and blurring conditions.

CN122244937APending Publication Date: 2026-06-19GRAVITYXR ELECTRONICS & TECH CO LTD

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: GRAVITYXR ELECTRONICS & TECH CO LTD
Filing Date: 2024-12-17
Publication Date: 2026-06-19

Application Information

Patent Timeline

17 Dec 2024

Application

19 Jun 2026

Publication

CN122244937A

IPC: G06V40/20; G06V10/764; G06V10/80; G06V10/82; G06V10/44

AI Tagging

Application Domain

Character and pattern recognition

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing technologies have low accuracy in gesture recognition, especially when the semantic features of the gesture are not obvious, the gesture is partially occluded, or the image is blurry, and cannot be applied to complex scenarios.

Method used

A gesture recognition method based on a neural network model is adopted, which combines the fusion of posture semantic features and key point features. Through a feature extraction module, a semantic feature representation module, a key point feature representation module, and a classifier, gesture type recognition is performed using the Gaussian distribution data of posture semantic features and key points.

Benefits of technology

It improves the accuracy and robustness of gesture recognition, and can still accurately identify gesture types even when some key points in the image are occluded or gesture features are not obvious.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122244937A_ABST

Patent Text Reader

Abstract

This application provides a gesture recognition method, model training method, and apparatus based on a neural network model. The gesture recognition method includes: using a feature extraction module to extract features from the gesture image to be recognized to obtain initial features; using a semantic feature representation module to obtain pose semantic features based on the initial features; using a keypoint feature representation module to obtain the coordinates and probability values of multiple keypoints in the gesture image based on the initial features, and determining the Gaussian distribution data of each keypoint based on the coordinates and probabilities of each keypoint; and using a classifier to classify the gesture image into at least one predefined gesture type based on the fusion result of the pose semantic features and the Gaussian distribution data of multiple keypoints. By comprehensively considering the pose semantic features and keypoint information for gesture recognition, the recognition accuracy is improved, and it can be applied to gesture recognition in scenarios such as image blurring and occluded gestures.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of image recognition technology, and in particular to a gesture recognition method, model training method and apparatus based on a neural network model. Background Technology

[0002] Gesture recognition is a crucial technology in applications such as virtual reality and augmented reality. It enables interaction with the virtual environment, such as grasping and moving objects.

[0003] In related technologies, gesture recognition often utilizes convolutional neural networks to extract pose semantic features from images and classifies them based on these features, outputting the type with the highest confidence as the gesture type. However, when the pose semantic features of a user's gesture are not obvious, such as a pinch gesture, or when the image contains a lot of noise or the gesture is partially occluded, the accuracy of the aforementioned method is low, or it may even fail to recognize the gesture.

[0004] Therefore, there is an urgent need to provide a gesture recognition solution with high accuracy and high robustness. Summary of the Invention

[0005] This application provides a gesture recognition method, model training method, and apparatus based on a neural network model. It achieves the fusion of posture semantic features and key point features, and uses the fused features to recognize gesture types, thereby improving the accuracy of recognition. Even when the posture semantic features of the gesture are not obvious, the gesture part is occluded, or the image is blurred, it can still accurately identify the gesture type in the image, demonstrating strong robustness.

[0006] In a first aspect, this application provides a gesture recognition method based on a neural network model, wherein the neural network model includes a feature extraction module, a semantic feature representation module, a key point feature representation module, and a classifier, and the method includes:

[0007] The feature extraction module is used to extract features from the gesture image to be recognized to obtain initial features;

[0008] Using the semantic feature representation module, pose semantic features are obtained based on the initial features;

[0009] Using the key point feature representation module, based on the initial features, the coordinates and probability values of multiple key points in the gesture image are obtained, and based on the coordinates and probabilities of each key point, the Gaussian distribution data of each key point is determined;

[0010] Using the classifier, the gesture image is classified into at least one predefined gesture type based on the fusion result of the gesture semantic features and the Gaussian distribution data of the multiple key points.

[0011] Optionally, the method further includes:

[0012] Calculate the product of the posture semantic features and the Gaussian distribution data of each key point to obtain the Gaussian thermal distribution data of each key point;

[0013] The classifier, based on the fusion result of the pose semantic features and the Gaussian distribution data of the multiple key points, classifies the gesture image into at least one predefined gesture type, including:

[0014] Using the classifier, the gesture image is classified into at least one predefined gesture type based on the Gaussian heat distribution data of each key point.

[0015] Optionally, the semantic feature representation module includes a first transformation unit and a feature processing unit; using the semantic feature representation module, based on the initial features, pose semantic features are obtained, including:

[0016] Using the first conversion unit, the initial features are converted into an attitude matrix;

[0017] Using the feature processing unit, the pose semantic features are determined based on the correlation between features at different positions in the pose matrix.

[0018] Optionally, the step of using the classifier to classify the gesture image into at least one predefined gesture type based on the Gaussian heat distribution data of each of the key points includes:

[0019] Using the classifier, based on the concatenation result of the pose matrix and the Gaussian heat distribution data of the multiple key points, the gesture image is classified into at least one predefined gesture type.

[0020] Optionally, the method further includes:

[0021] If the plurality of key points includes key points with probability values lower than a preset value, then the attitude matrix is spliced with the Gaussian thermal distribution data of each key point in the plurality of key points to obtain a spliced feature matrix;

[0022] The concatenated feature matrix is input into the classifier so that the classifier classifies the gesture image into at least one predefined gesture type based on the concatenated feature matrix.

[0023] Optionally, the method further includes:

[0024] If the probability values of the multiple key points are all greater than or equal to a preset value, then the Gaussian heat distribution data of each key point is input into the classifier, so that the classifier classifies the gesture image into at least one predefined gesture type based on the Gaussian heat distribution data of each key point.

[0025] Optionally, the neural network model further includes a key point calculation unit; the method further includes:

[0026] Using the key point calculation unit, the coordinates and probability values of the multiple key points are re-determined based on the input of the classifier.

[0027] Optionally, the method further includes:

[0028] Based on the re-determining of the coordinates of the multiple key points, the target item for the user's hand operation is determined;

[0029] Based on the gesture type, control instructions for the target item are generated.

[0030] Secondly, this application provides a model training method for training the neural network model provided in the first aspect of this application, the method comprising:

[0031] Acquire multiple gesture image samples and their gesture labels;

[0032] Based on the multiple gesture image samples and their gesture labels, a neural network model is trained until the training termination condition is met.

[0033] Optionally, the neural network model includes a backbone network, intermediate layers, and a detection head; the backbone network includes a feature extraction module, the intermediate layers include a semantic feature representation module and a key point feature representation module, and the detection head includes a classifier; the key point feature representation module includes a second transformation unit and a second key point calculation unit, the initial features are transformed by the second transformation unit and then input into the second key point calculation unit to obtain the coordinates and probability values of multiple key points in the gesture image;

[0034] During training, the detection head also includes a first keypoint calculation unit, which is used to recalculate the coordinates and probability values of the plurality of keypoints based on the input of the classifier;

[0035] The loss function of the neural network model includes a loss term for the gesture label and the gesture type predicted by the neural network model, and a loss term for the coordinates and probability values of the plurality of key points output by the first key point calculation unit and the coordinates and probability values of the plurality of key points output by the second key point calculation unit.

[0036] Thirdly, this application provides a gesture recognition device based on a neural network model, wherein the neural network model includes a feature extraction module, a semantic feature representation module, a key point feature representation module, and a classifier, and the device includes:

[0037] An initial feature extraction module is used to extract features from the gesture image to be recognized, thereby obtaining initial features from the gesture image.

[0038] The pose semantic acquisition module is used to obtain pose semantic features based on the initial features using the semantic feature representation module;

[0039] The key point extraction module is used to obtain the coordinates and probability values of multiple key points in the gesture image based on the initial features using the key point feature representation module, and to determine the Gaussian distribution data of each key point based on the coordinates and probability of each key point.

[0040] The gesture recognition module is used to classify the gesture image into at least one predefined gesture type using the classifier based on the fusion result of the pose semantic features and the Gaussian distribution data of the multiple key points.

[0041] Fourthly, this application provides a model training apparatus for training the neural network model provided in the first aspect of this application, the model training apparatus comprising:

[0042] The sample acquisition module is used to acquire multiple gesture image samples and their gesture labels;

[0043] The model training module is used to train a neural network model based on the multiple gesture image samples and their gesture labels until the training termination condition is met.

[0044] Fifthly, this application provides an electronic device, including: a processor and a memory, wherein code is stored in the memory, and the processor executes the code stored in the memory to perform the method provided in the first or second aspect of this application.

[0045] In a sixth aspect, this application provides a computer-readable storage medium storing computer-executable instructions, which, when executed by a processor, implement the method provided in the first or second aspect of this application.

[0046] In a seventh aspect, this application provides a computer program product, including a computer program that, when executed by a processor, implements the method provided in the first or second aspect of this application.

[0047] The gesture recognition method, model training method, and apparatus based on a neural network model provided in this application have a feature extraction module with two branches: a semantic feature representation module and a key point feature representation module. These modules are used to extract the semantic features of the pose and the features of the key points, respectively, to obtain pose semantic features that represent the pose semantics and Gaussian distribution data that describes the spatial distribution of the key points. The classifier in the model head uses the result of fusing the pose semantic features and the Gaussian distribution data of the key points to predict the gesture type. This fully considers the features of both key points and pose, improving the accuracy of gesture type prediction. It can accurately identify the gesture type through pose features even when some key points in the image are occluded, and can accurately identify the gesture type through key point features even when the gesture features in the image are not obvious, thus improving the robustness of gesture recognition. Attached Figure Description

[0048] The accompanying drawings, which are incorporated in and form part of this specification, illustrate embodiments consistent with this application and, together with the description, serve to explain the principles of this application.

[0049] Figure 1A This is a schematic diagram of a gesture type recognition process provided in related technologies;

[0050] Figure 1B This is a schematic diagram of another gesture type recognition process provided in related technologies;

[0051] Figure 2 This is a schematic diagram of the structure of a neural network model provided in an embodiment of this application;

[0052] Figure 3 A flowchart illustrating a gesture recognition method based on a neural network model provided in this application embodiment;

[0053] Figure 4 This is a schematic diagram of another neural network model provided in an embodiment of this application;

[0054] Figure 5 This is a schematic diagram of the structure of another neural network model provided in an embodiment of this application;

[0055] Figure 6 A flowchart illustrating another gesture recognition method based on a neural network model provided in this application embodiment;

[0056] Figure 7 A schematic diagram of a human-computer interaction process provided in an embodiment of this application;

[0057] Figure 8 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application.

[0058] The accompanying drawings illustrate specific embodiments of this application, which will be described in more detail below. These drawings and descriptions are not intended to limit the scope of the concept in any way, but rather to illustrate the concept of this application to those skilled in the art through reference to particular embodiments. Detailed Implementation

[0059] Exemplary embodiments will now be described in detail, examples of which are illustrated in the accompanying drawings. When the following description relates to the drawings, unless otherwise indicated, the same numbers in different drawings denote the same or similar elements. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with this application. Rather, they are merely examples of apparatuses and methods consistent with some aspects of this application as detailed in the appended claims.

[0060] Natural User Interface (NUI) allows users to interact with computers through voice, gestures, eye movements, and other means, enabling users to interact in a more natural way, reducing learning costs, and increasing immersion. Gesture recognition is an important category within NUI.

[0061] In the field of human-computer interaction, gestures are defined as a collection of information with specific meanings formed by different combinations of the palms and fingers.

[0062] Gesture recognition aims to use mathematical algorithms, such as visual recognition algorithms, to identify the specific type of a user's gesture, such as a fist (fist), a pinch (pinch), or a thumbs-up (like). Gestures are diverse and complex, and the same gesture often appears differently to different users. Gesture recognition typically utilizes models with strong learning capabilities trained on a large number of training samples.

[0063] Figure 1A This is a schematic diagram of a gesture type recognition process provided in related technologies, such as... Figure 1A As shown, in this gesture recognition scheme, a convolutional neural network layer is typically used to extract features from the input gesture image, and then a detection head for classification converts the extracted features into category information to obtain the gesture type, thereby achieving gesture recognition.

[0064] This approach relies on the semantic features of gestures in images, requiring high image clarity and making it unsuitable for scenarios where gesture semantic features are not obvious, key features are occluded, or the image is blurry. For example, the pinch gesture cannot be recognized when the fingertip is obscured.

[0065] Figure 1BThis is a schematic diagram of another gesture type recognition process provided in related technologies, such as... Figure 1B As shown, in this gesture recognition scheme, the coordinates of multiple key points of the input gesture image are extracted by the key point detection module, and the gesture type is obtained by the post-processing module based on the spatial relationship of the detected key points.

[0066] Gesture recognition methods based on key point detection suffer from several drawbacks. When the calculation error of a certain key point is large, the error will accumulate in the calculation of other key points, affecting the accuracy of gesture recognition. This method also fails to accurately identify the gesture type when the key point is occluded.

[0067] In summary, the accuracy of the gesture recognition strategies provided by the two aforementioned schemes needs improvement and they cannot be applied to complex scenarios, such as those with occlusion, blurred images, or unclear gesture features.

[0068] Based on this, this application provides a gesture recognition method based on a neural network model. The neural network model includes a feature extraction module, a semantic feature representation module, a key point feature representation module, and a classifier. The semantic feature representation module and the key point feature representation module are used to extract pose semantic features and key point features, respectively. The key point features are specifically Gaussian distribution data determined based on the coordinates and probabilities of the key points. The classifier is used to identify the gesture type based on the fused features obtained after fusing pose semantic features and key point features. By comprehensively considering both pose and key point features for gesture recognition, this method utilizes features from both semantic and spatial dimensions, improving the accuracy of gesture recognition. Even when some key points in the image are occluded, the gesture type can still be accurately identified through pose features. Furthermore, even when gesture features are not obvious in the image, the gesture type can still be accurately identified through key point features, thus improving the robustness of gesture recognition.

[0069] The technical solution of this application and how the technical solution of this application solves the above-mentioned technical problems are described in detail below with specific embodiments. These specific embodiments can be combined with each other, and the same or similar concepts or processes may not be described again in some embodiments. The embodiments of this application will now be described with reference to the accompanying drawings.

[0070] Figure 2 This is a schematic diagram of the structure of a neural network model provided in an embodiment of this application, such as... Figure 2 As shown, the neural network model includes a feature extraction module, a semantic feature representation module, a key point feature representation module, and a classifier.

[0071] In some embodiments, this neural network model may also be referred to as a gesture recognition model.

[0072] The system comprises the following modules: a feature extraction module for extracting features from the input gesture image to obtain initial features M0; initial features M0 are then processed by a semantic feature representation module and a keypoint feature representation module; the semantic feature representation module is used for pose estimation, and the keypoint feature representation module is used for keypoint calculation; the semantic feature representation module converts the initial features into pose semantic features MA; the keypoint feature representation module obtains the coordinates and probability values of multiple keypoints in the gesture image based on the initial features, and determines the Gaussian distribution data MG of the keypoints based on the coordinates and probability values; and a classifier identifies the gesture type of the gesture image based on the fusion result of the Gaussian distribution data MG of multiple keypoints and the pose semantic features MA, that is, classifies the gesture image into at least one predefined gesture type.

[0073] The Gaussian distribution data MG for key points is the data corresponding to a Gaussian distribution plot with the coordinates of the key points as the mean and the probability of the key points as the variance.

[0074] Figure 3 This is a flowchart illustrating a gesture recognition method based on a neural network model, provided in an embodiment of this application. This method can be executed by an electronic device with corresponding data processing capabilities, such as a gesture recognition chip, an XR (Extended Reality) device, etc., where XR devices include VR (Virtual Reality) devices, AR (Augmented Reality) devices, and MR (Mixed Reality) devices. This gesture recognition method is based on... Figure 2 The neural network model provided in the illustrated embodiment is specifically based on a pre-trained neural network model.

[0075] like Figure 3 As shown, the gesture recognition method based on a neural network model includes the following steps:

[0076] Step S301: Use the feature extraction module to extract features from the gesture image to be recognized to obtain initial features.

[0077] The gesture image to be recognized is an image containing the user's hand, which can be an image obtained by segmenting the hand from the original image collected during human-computer interaction.

[0078] The feature extraction module is the backbone of the neural network model, and can be any type of module used to extract image features, such as a convolutional neural network.

[0079] Specifically, the feature extraction module is used to downsample the gesture image, for example, by 32 times, to obtain initial features with high semantic value and small size.

[0080] The initial features can include four parameters: B (Batch), C (Channel), H (Height), and W (Width), which can be represented as M0(B,C,H,W).

[0081] Step S302: Using the semantic feature representation module, pose semantic features are obtained based on the initial features.

[0082] Step S303: Using the key point feature representation module, based on the initial features, obtain the coordinates and probability values of multiple key points in the gesture image, and determine the Gaussian distribution data of the key points based on the coordinates and probabilities of each key point.

[0083] After the gesture image is processed by the feature extraction module to obtain initial features, these initial features are transmitted to the semantic feature representation module and the key point feature representation module, respectively.

[0084] The semantic feature representation module and the keypoint feature representation module constitute the neck or intermediate layer of the hand recognition model. The semantic feature representation module, also known as the pose estimation branch, is used to convert initial features into pose semantic features. The keypoint feature representation module, also known as the keypoint calculation branch, is used to calculate keypoints based on the initial features, extracting the coordinates and probability values of multiple keypoints in the gesture image. Then, through a Gaussian distribution plotting step, it draws a Gaussian distribution map representing the spatial distribution of the keypoints based on the coordinates and probability values. The Gaussian distribution data is the data corresponding to this Gaussian distribution map.

[0085] The semantic feature representation module may include a first transformation unit and a feature processing unit. The first transformation unit is used to transform the initial features and map them to the space corresponding to the pose semantic features to obtain a pose matrix. The feature processing unit is used to obtain pose semantic features with richer dimensions based on the correlation between features at different positions in the pose matrix.

[0086] For example, the first transformation unit can be implemented based on depthwise separable convolution to perform minor targeted processing on the initial features. The feature processing unit can perform feature processing based on an attention mechanism to obtain pose semantic features.

[0087] The keypoint feature representation module may include a keypoint calculation unit. The initial features are processed by this keypoint calculation unit, which outputs the coordinates and probability values of multiple keypoints. Then, using the coordinates and probability values, a Gaussian distribution map of the keypoints is generated, thus obtaining the Gaussian distribution data of the keypoints.

[0088] The coordinates of keypoints can be two-dimensional or three-dimensional. The probability value of a keypoint describes the probability of the keypoint in the predicted coordinates; occluded keypoints have a relatively low probability value.

[0089] The total number of key points for the hand is 21. The number of key points output by the key point calculation unit is usually a few of these 21 key points, such as 5, 7, 10, etc.

[0090] When the coordinates of the key points are two-dimensional coordinates such as [x,y], each key point corresponds to a Gaussian distribution map. Therefore, the key point feature representation module can output a maximum of 21 Gaussian distribution maps corresponding to Gaussian distribution data.

[0091] When the coordinates of the key point are three-dimensional coordinates such as [x,y,z], we can use two of the key point's coordinates, such as xy, xz, and yz, as well as the key point's probability value, to generate a Gaussian distribution map for each pair of coordinates, thus obtaining three Gaussian distribution maps.

[0092] Step S304: Using a classifier, based on the fusion result of the gesture semantic features and the Gaussian distribution data of the multiple key points, classify the gesture image into at least one predefined gesture type.

[0093] The predefined gesture types include a variety of gesture types, such as pinch gesture, thumbs-up gesture, fist gesture, etc.

[0094] The classifier is the detection head of the neural network model, and it can adopt any structure, such as a fully connected layer followed by a softmax layer.

[0095] After obtaining the pose semantic features and Gaussian distribution data of each key point, the pose semantic features are fused with the Gaussian distribution data, such as by multiplication, superposition, or concatenation. The fusion result is then input into a classifier, and the gesture type of the gesture image input into the neural network model is obtained through the classifier's inference.

[0096] The gesture recognition method based on a neural network model provided in this embodiment has a feature extraction module connected to two branches: a semantic feature representation module and a key point feature representation module. These modules are used to extract the semantic features of the posture and the features of the key points, respectively, to obtain posture semantic features that characterize the posture semantics and Gaussian distribution data that describes the spatial distribution of the key points. The classifier in the model head uses the result of fusing the posture semantic features and the Gaussian distribution data of the key points to predict the gesture type. This fully considers the features of both key points and posture, improving the accuracy of gesture type prediction. Even when some key points in the image are occluded, the gesture type can still be accurately identified through posture features, and even when the gesture features are not obvious in the image, the gesture type can be accurately identified through key point features, thus improving the robustness of gesture recognition.

[0097] Optionally, the method further includes:

[0098] The Gaussian thermal distribution data of each key point is obtained by multiplying the posture semantic features with the Gaussian distribution data of each key point.

[0099] Accordingly, a classifier is used to classify the gesture image into at least one predefined gesture type based on the fusion result of the pose semantic features and the Gaussian distribution data of the multiple key points, including:

[0100] Using the classifier, the gesture image is classified into at least one predefined gesture type based on the Gaussian heat distribution data of each key point.

[0101] By using the multiplication method, Gaussian distributed data and pose semantic features are fused element-wise. The multiplication method can enhance the features that appear in both types of features and weaken the features that differ greatly, thus focusing on the common features during recognition and improving the accuracy of recognition.

[0102] Figure 4 For a schematic diagram of another neural network model provided in this application embodiment, see [link to schematic diagram]. Figure 2 and Figure 4In this embodiment, the semantic feature representation module includes a first transformation unit and a SAM (Spatial Attention Module), and the keypoint feature representation module includes a keypoint calculation unit. The first transformation unit makes minor adjustments to the input initial feature M0, mapping it to the corresponding feature space to obtain the pose matrix M1. The SAM is an example of a feature processing unit. It uses a spatial attention mechanism to calculate the similarity or correlation between each position in the pose matrix M1 and other positions to obtain attention weights. The input pose matrix M1 is then weighted using these attention weights to obtain the pose semantic feature MA. In the keypoint feature representation module, the keypoint calculation unit calculates the coordinates [x, y] of multiple keypoints in the gesture image and their probability values prob based on the input initial feature M0. Based on the coordinates and probability values, it generates Gaussian distribution maps for each keypoint, resulting in Gaussian distribution data MG. During fusion, the Gaussian distribution data MG is multiplied element-wise with the pose semantic feature MA (mul) to obtain Gaussian thermal distribution data MH. The Gaussian thermal distribution data MH is then fed into a classifier, which calculates the probability values for multiple categories. Through activation layers such as the softmax layer, the category with the highest probability value is output, thus determining the category with the highest probability as the gesture type in the gesture image.

[0103] To further improve recognition accuracy and avoid inaccurate recognition when the calculation error of key points is large, the Gaussian thermal distribution data MH can be concatenated with the attitude matrix M1 before being input into the classifier.

[0104] In some embodiments, the detection head of the neural network model may include a keypoint calculation unit in addition to the classifier. This keypoint calculation unit receives the same input as the classifier and is used to recalculate the coordinates and probabilities of multiple keypoints in the gesture image. The output of the neural network model may include, in addition to the gesture type output by the classifier, the coordinates of multiple keypoints output by the keypoint calculation unit, or the probability values of the coordinates of multiple keypoints output by the keypoint calculation unit.

[0105] Figure 5 See also the schematic diagram of another neural network model provided in the embodiments of this application. Figure 4 and Figure 5 In this embodiment, the key point feature representation module includes a second conversion unit and a second key point calculation unit, and the detection head includes a first key point calculation unit in addition to the classifier.

[0106] The initial feature M0 is transformed by the second transformation unit and then input into the second keypoint calculation unit to obtain the coordinates [x, y] and probability values (prob) of multiple keypoints. Based on the coordinates and probability values, a Gaussian distribution map of each keypoint is generated, resulting in Gaussian distribution data MG. The Gaussian distribution data MG is then multiplied element-wise with the pose semantic feature MA (mul) to obtain Gaussian thermal distribution data MH. The Gaussian thermal distribution data MH is concatenated with the pose matrix M1 and then input into the classifier and the first keypoint calculation unit in the detection head, respectively. Figure 5 The "+" symbol is used to represent concatenation. The first keypoint calculation unit recalculates the keypoints based on the feature matrix obtained after concatenation, and outputs the new coordinates [X,Y] of the calculated keypoints. It can also output the probability value Prob of the new coordinates.

[0107] The first and second keypoint calculation units can have the same structure, both being deep learning models such as OpenPose and HandNet. The input dimensions of the first and second keypoint calculation units differ: the input to the first keypoint calculation unit is the initial features, while the input to the second keypoint calculation unit is the matrix obtained by concatenating the Stern heat distribution data MH with the pose matrix M1.

[0108] The second transformation unit can have the same structure as the first transformation unit. For example, the second transformation unit can be a depthwise separable convolution.

[0109] Because a key point calculation unit, namely the first key point calculation unit, is added to the detection head, the neural network model can output the coordinates of key points extracted from the gesture image in addition to the gesture type.

[0110] During the training phase, the parameters of the first keypoint calculation unit can be adjusted by the deviation between the outputs of the two designed keypoint calculation units.

[0111] Figure 6 This is a flowchart illustrating another gesture recognition method based on a neural network model provided in this application embodiment. Figure 3 Based on the embodiment shown, further limitations are made to steps S302 and S303, and related steps for recalculating key points are added.

[0112] The gesture recognition method provided in this implementation can be based on Figure 5 The neural network model implementation provided in the illustrated embodiment is as follows: Figure 6 As shown, the gesture recognition method provided in this embodiment may specifically include the following steps:

[0113] Step S601: Use the feature extraction module to extract features from the gesture image to be recognized to obtain initial features.

[0114] After segmenting the acquired raw image to obtain the gesture image, the gesture image is input into the feature extraction module of the neural network model. The feature extraction module extracts features from the gesture image to obtain the initial features.

[0115] To reduce complexity, the feature extraction module can choose a lightweight object detection network, such as ResNet18, MobileNetv2, ShuffleNetv2, etc.

[0116] Step S602: Using the first conversion unit, the initial features are converted into an attitude matrix.

[0117] The first transformation unit is used to map the initial features to the space where the pose semantic features are located, and can be implemented through a convolutional layer.

[0118] The first transformation unit can be a depthwise separable convolution (DSC) layer, such as three depthwise separable convolution layers. Depthwise separable convolution can freely change the number of output channels and can achieve channel fusion of features, resulting in high computational efficiency.

[0119] Step S603: Using the feature processing unit, determine the pose semantic features based on the correlation between features at different positions in the pose matrix.

[0120] The feature processing unit can be SAM, which calculates the correlation between each position in the pose matrix and other positions through a spatial attention mechanism to obtain attention weights. The input pose matrix is then weighted by the attention weights to obtain pose semantic features.

[0121] The feature processing unit can also obtain pose semantic features by transposing the pose matrix, then pooling it through an average pooling layer (avgpool) and transposing it again.

[0122] Step S604: Using the key point feature representation module, based on the initial features, obtain the coordinates and probability values of multiple key points in the gesture image, and determine the Gaussian distribution data of the key points based on the coordinates and probabilities of each key point.

[0123] The key point feature representation module may include a second transformation unit and a key point calculation unit, namely the second key point calculation unit. The second transformation unit may have the same structure as the first transformation unit. For example, the second transformation unit may be a depthwise separable convolution.

[0124] The initial features, after passing through the second transformation unit, are input into the second keypoint calculation unit to obtain the coordinates of multiple keypoints and their probability values at those coordinates. Then, using the keypoint coordinates as the mean and the probability values as the variance, a Gaussian distribution map of the keypoints is plotted, yielding the Gaussian distribution data of the keypoints. The Gaussian distribution map can be represented as:

[0125]

[0126] Where u and v are the coordinates of the keypoints, and prob is the probability value of the keypoint.

[0127] The higher the probability value, the narrower the Gaussian distribution map, and the more concentrated the key points are in the image. This results in higher confidence of the key points and a greater contribution to gesture recognition. Conversely, the lower the probability value, the wider the Gaussian distribution map, and the more widely the key points are distributed in the image. This results in lower confidence of the key points and a smaller contribution to gesture recognition.

[0128] Step S605: Calculate the product of the attitude semantic features and the Gaussian distribution data of each key point to obtain the Gaussian thermal distribution data of each key point.

[0129] After obtaining the Gaussian distribution data of the pose semantic features and the key points through the two branches of the semantic feature representation module and the key point feature representation module, the features output by the two branches need to be fused. One optional fusion method is multiplication, which is to multiply the pose semantic features and the Gaussian distribution data of the key points element by element to obtain the Gaussian thermal distribution data.

[0130] Step S606: Using a classifier, based on the splicing result of the pose matrix and the Gaussian heat distribution data of the multiple key points, classify the gesture image into at least one predefined gesture type.

[0131] Before inputting the Gaussian thermal distribution data into the classifier, the Gaussian thermal distribution data of multiple key points can be concatenated with the pose matrix output by the first transformation unit of the semantic feature representation module to obtain a concatenated feature matrix. The concatenated feature matrix is then input into the classifier for gesture type recognition. By concatenating the pose features, the weight of pose-related features during gesture recognition is increased, avoiding the inability to accurately identify gesture types when the calculation error at key points is large.

[0132] Step S607: Using the key point calculation unit, based on the input of the classifier, redetermine the coordinates and probability values of the multiple key points.

[0133] Key point calculation unit such as Figure 5The first keypoint calculation unit, along with the classifier, is located in the detection head of the neural network model. The input to the keypoint calculation unit is the same as that to the classifier: a matrix obtained by concatenating the Gaussian heatmap data of multiple keypoints with the pose matrix. Using the fused features, the keypoint calculation unit recalculates the coordinates and probability values of multiple keypoints in the gesture image.

[0134] In the application phase, in addition to determining the user's operation intention based on the gesture type output by the neural network model, the coordinates and probability values of multiple key points output by the neural network model can also be used to determine the user's operation intention, thereby improving the accuracy of user operation intention recognition and thus improving the accuracy of device response.

[0135] In some applications, hand simulation images can be generated based on the gesture type output by the neural network model, as well as the coordinates and probability values of multiple key points. These hand simulation images can be used as training samples for other gesture recognition models to increase the number of training samples, or they can be used as a basis for evaluating gesture recognition models.

[0136] Furthermore, based on the gesture type output by the neural network model for the gesture image, as well as the coordinates and probability values of multiple key points, the user's operation intention corresponding to the gesture image can be determined. This operation intention can include the operation object and operation type; based on the user's operation intention, an operation response can be performed.

[0137] Furthermore, the gesture recognition method also includes:

[0138] Based on the redefined coordinates of the multiple key points, the target item for the user's hand operation is determined; based on the gesture type, control instructions for the target item are generated.

[0139] The coordinates of the key points are three-dimensional coordinates. Specifically, multiple gesture images captured by a multi-view camera are input into the neural network model; the initial features of these gesture images, after being processed by the second transformation unit, are combined with the intrinsic and extrinsic parameters of the multi-view camera and input into the second key point calculation unit. The second key point calculation unit determines the three-dimensional coordinates of multiple key points. Furthermore, through subsequent steps, the coordinates of the key points output by the first key point calculation unit are also three-dimensional coordinates.

[0140] Based on the three-dimensional coordinates of multiple key points output by the first key point calculation unit, the position of the user's hand in three-dimensional space can be determined. This allows for the identification of objects in that three-dimensional space that overlap with the user's hand, which are then designated as the target object for the user's hand operation, such as an object being held by the user. The three-dimensional space can be a virtual space or a space that combines virtual and real-world elements.

[0141] Specifically, the target item being manipulated by the user's hand can be determined based on the coordinates of n key points with high probability values output by the neural network model. Here, n is a positive integer, such as 1, 3, 5, or other values.

[0142] For example, the coordinates of multiple key points can be sorted in descending order of probability value, and the target item for the user's hand operation can be determined based on the first n coordinates in the sorting result.

[0143] A pre-established mapping between gesture type, item, and control command can be created. After determining the user's gesture type and the target item being manipulated, the control command for the target item can be determined by looking up the mapping, and the system can respond based on the control command, such as adjusting the position or state of the target item.

[0144] For example, Figure 7 This is a schematic diagram of a human-computer interaction process provided in an embodiment of this application. Figure 7 Taking a scenario where users interact with a head-mounted device through gestures as an example, such as Figure 7 As shown, during the interaction between the user and the head-mounted device via gestures, the head-mounted device's camera captures images of the user's hands, obtaining the raw image. Hand recognition and segmentation are then performed on the raw image to obtain a gesture image. This gesture image is input into a gesture recognition model deployed in the gesture recognition chip within the head-mounted device. The gesture recognition model outputs the gesture type (e.g., pinch) and the coordinates of multiple key points extracted from the gesture image, such as... Figure 7 The coordinates of the thumb tip (x1, y1, z1) and the index finger tip (x2, y2, z2) are given. The gesture response module in the head-mounted device, based on the gesture type output by the gesture recognition model and the coordinates of multiple key points, determines the object the user intends to interact with from the current scene, such as... Figure 7 If the user's gesture type is "pinch," then the user's intention is determined to be to grab the wooden block. Therefore, the wooden block is moved from its original position to a state where it is being grabbed by the user. Figure 7 As shown in the middle right figure. Figure 7 The gesture recognition model in the middle can be Figure 5 The neural network model provided in the illustrated embodiment.

[0145] By incorporating the coordinates of key points, the object that the user wants to operate can be accurately identified, improving the accuracy of recognizing the user's operation intention and reducing the probability of false responses.

[0146] In this embodiment, after fusing Gaussian distributed data and pose semantic features, and then concatenating them with the pose matrix, the weight of pose dimension features in the classifier's prediction classification is increased. This allows for accurate identification of gesture types in images even when keypoint calculation errors are large, further improving the accuracy and robustness of gesture type recognition. By adding keypoint calculation units to the neural network model, the neural network model has the ability to output more accurate keypoint coordinates. By providing more dimensional data, a more sufficient data foundation is provided for the application stage, resulting in a wider range of applications.

[0147] In some embodiments, after obtaining Gaussian thermal distribution data, the gesture recognition method further includes:

[0148] If the plurality of key points includes key points with probability values lower than a preset value, then the pose matrix is concatenated with the Gaussian heat distribution data of each key point to obtain a concatenated feature matrix. The concatenated feature matrix is then input into the classifier so that the classifier classifies the gesture image into at least one predefined gesture type based on the concatenated feature matrix. If the probability values of the plurality of key points are all greater than or equal to the preset value, then the Gaussian heat distribution data of each key point is input into the classifier so that the classifier classifies the gesture image into at least one predefined gesture type based on the Gaussian heat distribution data of each key point.

[0149] The preset value can be a default value or a configurable parameter that can be set according to requirements.

[0150] When the probability values of multiple key points obtained through the key point feature representation module, such as the second key point calculation unit, specifically the probability values of the key points in the calculated coordinates, are all low (below the preset value), it indicates that the key point calculation error is large and the key point dimension features are unreliable. In the feature fusion stage, the fused result needs to be concatenated with the pose matrix output from the intermediate layer of the semantic feature representation module. The Gaussian distribution data of the key points is fused with the pose semantic features, such as the aforementioned Gaussian thermal distribution data. This concatenation with the pose matrix is then input into the detection head, for example, into a classifier, or into the classifier and the first key point calculation unit. This increases the weight of the pose dimension features when the key point calculation error is large, thereby improving the model's recognition accuracy.

[0151] When the probability values of multiple key points obtained through the key point feature representation module, such as the second key point calculation unit, are all high (i.e., not lower than the preset value), it indicates that the error in key point calculation is small and the features of the key point dimension are reliable. In the feature fusion stage, it is not necessary to concatenate the pose matrix output by the intermediate layer of the semantic feature representation module. Instead, the Gaussian distribution data of the key points and the result of fusing the pose semantic features, such as the aforementioned Gaussian thermal distribution data, can be input into the detection head. This reduces the computational load while ensuring accurate recognition results.

[0152] Whether Gaussian thermal distribution data is concatenated with the attitude matrix can be controlled by a gate unit. This gate unit controls whether the attitude matrix participates in the subsequent concatenation operation.

[0153] This application embodiment also provides a model training method for training the neural network model provided in the foregoing embodiments. The model training method includes:

[0154] Acquire multiple gesture image samples and their gesture labels; train a neural network model based on the multiple gesture image samples and their gesture labels until the training termination condition is met.

[0155] Gesture image samples can be images of users' hands collected during historical human-computer interactions, or segmented results of historically collected gesture images. These multiple gesture image samples should cover gesture images under various gestures to improve the generalization ability of the trained model.

[0156] The gesture labels of gesture image samples are the ground truth values of the gesture image samples, which can be manually annotated.

[0157] Multiple gesture image samples and their gesture labels can be divided into multiple batches. After each batch of gesture image samples is input into the neural network model, the gesture type and corresponding gesture label of each input gesture image sample are predicted by the neural network model, and the loss value is calculated. Through backpropagation of the loss value, the parameters of the neural network model are adjusted until the calculated loss value is lower than a set threshold, or the number of training rounds, time, etc. reach the corresponding upper limit, that is, the training termination condition is met, and the pre-trained neural network model is output.

[0158] Optionally, the neural network model includes a backbone network, a neck layer, and a head; the backbone network includes the feature extraction module, the neck layer includes the semantic feature representation module and the keypoint feature representation module, and the head includes the classifier; the keypoint feature representation module includes a second transformation unit and a second keypoint calculation unit, the initial features are transformed by the second transformation unit and then input to the second keypoint calculation unit to obtain the coordinates and probability values of multiple keypoints in the gesture image; during training, the head also includes a first keypoint calculation unit, used to recalculate the coordinates and probability values of the multiple keypoints based on the input of the classifier; the loss function of the neural network model includes a loss term for the gesture label and the gesture type predicted by the neural network model, and a loss term for the coordinates and probability values of the multiple keypoints output by the first keypoint calculation unit and the coordinates and probability values of the multiple keypoints output by the second keypoint calculation unit.

[0159] Compared with the gesture recognition method provided in the foregoing embodiments, this application also provides a gesture recognition device based on a neural network model. The neural network model includes a feature extraction module, a semantic feature representation module, a key point feature representation module, and a classifier. The device includes:

[0160] An initial feature extraction module is used to extract features from the gesture image to be recognized using the feature extraction module, thereby obtaining initial features from the gesture image; a pose semantic acquisition module is used to obtain pose semantic features based on the initial features using the semantic feature representation module; a key point extraction module is used to obtain the coordinates and probability values of multiple key points in the gesture image based on the initial features using the key point feature representation module, and to determine the Gaussian distribution data of each key point based on the coordinates and probabilities of each key point; a gesture recognition module is used to classify the gesture image into at least one predefined gesture type using the classifier based on the fusion result of the pose semantic features and the Gaussian distribution data of the multiple key points.

[0161] Optionally, the gesture recognition device also includes a multiplication module for:

[0162] The Gaussian thermal distribution data of each key point is obtained by multiplying the posture semantic features with the Gaussian distribution data of each key point.

[0163] Correspondingly, the gesture recognition module is specifically used for:

[0164] The gesture image is classified into at least one predefined gesture type by the classifier based on the Gaussian heat distribution data of each key point.

[0165] Optionally, the semantic feature representation module includes a first transformation unit and a feature processing unit; the pose semantic acquisition module is specifically used for:

[0166] The initial features are converted into an attitude matrix by the first conversion unit; the attitude semantic features are determined by the feature processing unit based on the correlation between features at different positions in the attitude matrix.

[0167] Optional, gesture recognition module, specifically used for:

[0168] The gesture type of the gesture image is determined by the classifier based on the concatenation result of the pose matrix and the Gaussian thermal distribution data of the multiple key points.

[0169] Optionally, the gesture recognition device also includes a stitching control module for:

[0170] If the plurality of key points includes key points with probability values lower than a preset value, then the pose matrix is concatenated with the Gaussian heat distribution data of each key point to obtain a concatenated feature matrix; the concatenated feature matrix is input into the classifier so that the classifier classifies the gesture image into at least one predefined gesture type based on the concatenated feature matrix; if the probability values of the plurality of key points are all greater than or equal to the preset value, then the Gaussian heat distribution data of each key point is input into the classifier so that the classifier classifies the gesture image into at least one predefined gesture type based on the Gaussian heat distribution data of each key point.

[0171] Optionally, the detection head of the neural network model further includes a key point calculation unit; the gesture recognition device further includes a key point recalculation module, used for:

[0172] The coordinates and probability values of the multiple key points are re-determined based on the input of the classifier via the key point calculation unit.

[0173] Optionally, the gesture recognition device further includes a control command generation module for:

[0174] Based on the redefined coordinates of the multiple key points, the target item for the user's hand operation is determined; based on the gesture type, control instructions for the target item are generated.

[0175] The gesture recognition device based on a neural network model provided in this application can be used to execute the technical solution of the gesture recognition method based on a neural network model provided in any of the above embodiments of this application. The implementation principle and technical effect are similar, and will not be repeated here.

[0176] Compared with the model training method provided in the foregoing embodiments, this application also provides a model training apparatus for training the neural network model provided in the first aspect of this application. The model training apparatus includes:

[0177] The sample acquisition module is used to acquire multiple gesture image samples and their gesture labels; the model training module is used to train a neural network model based on the multiple gesture image samples and their gesture labels until the training termination condition is met.

[0178] Optionally, the neural network model includes a backbone network, intermediate layers, and a detection head; the backbone network includes the feature extraction module, the intermediate layers include the semantic feature representation module and the keypoint feature representation module, and the detection head includes the classifier; the keypoint feature representation module includes a second transformation unit and a second keypoint calculation unit, the initial features are transformed by the second transformation unit and then input to the second keypoint calculation unit to obtain the coordinates and probability values of multiple keypoints in the gesture image; during training, the detection head also includes a first keypoint calculation unit, used to recalculate the coordinates and probability values of the multiple keypoints based on the input of the classifier; the loss function of the neural network model includes a loss term for the gesture label and the gesture type predicted by the neural network model, and a loss term for the coordinates and probability values of the multiple keypoints output by the first keypoint calculation unit and the coordinates and probability values of the multiple keypoints output by the second keypoint calculation unit.

[0179] The model training apparatus provided in this application embodiment can be used to execute the technical solution of the model training method provided in any of the above embodiments of this application. Its implementation principle and technical effect are similar, and will not be described again in this embodiment.

[0180] Figure 8 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. Figure 8 As shown, the electronic device provided in this embodiment may include: at least one processor 801; and a memory 802 communicatively connected to the at least one processor; wherein the memory 802 stores instructions that can be executed by the at least one processor 801, and the instructions are executed by the at least one processor 801 to cause the electronic device to perform the method as described in any of the above embodiments.

[0181] Optionally, the memory 802 can be either standalone or integrated with the processor 801.

[0182] The implementation principle and technical effects of the electronic device provided in this embodiment can be found in the foregoing embodiments, and will not be repeated here.

[0183] This application also provides a computer-readable storage medium storing computer-executable instructions. When the computer-executable instructions are executed by a processor, the methods provided in any of the foregoing embodiments can be implemented.

[0184] This application also provides a computer program product, including a computer program that, when executed by a processor, implements the method provided in any of the foregoing embodiments.

[0185] In the several embodiments provided in this application, it should be understood that the disclosed devices and methods can be implemented in other ways. For example, the device embodiments described above are merely illustrative. For instance, the division of modules is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple modules may be combined or integrated into another system, or some features may be ignored or not executed.

[0186] The integrated modules implemented as software functional modules described above can be stored in a computer-readable storage medium. These software functional modules, stored in a storage medium, include several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) or processor to execute some steps of the methods described in the various embodiments of this application.

[0187] It should be understood that the aforementioned processor can be a Central Processing Unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), etc. A general-purpose processor can be a microprocessor or any conventional processor. The steps of the method disclosed in the application can be directly manifested as being executed by a hardware processor, or executed by a combination of hardware and software modules within the processor. The memory may include high-speed memory, and may also include non-volatile memory, such as at least one disk storage device, and may also be a USB flash drive, external hard drive, read-only memory, disk, or optical disc, etc.

[0188] The aforementioned storage medium can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory, electrically erasable programmable read-only memory, erasable programmable read-only memory, programmable read-only memory, read-only memory, magnetic storage, flash memory, magnetic disk, or optical disk. The storage medium can be any available medium that can be accessed by a general-purpose or special-purpose computer.

[0189] An exemplary storage medium is coupled to a processor, enabling the processor to read information from and write information to the storage medium. Alternatively, the storage medium can be an integral part of the processor. The processor and storage medium can reside within an application-specific integrated circuit (ASIC). Alternatively, the processor and storage medium can exist as discrete components in an electronic device.

[0190] It should be noted that, in this document, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Unless otherwise specified, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes that element.

[0191] The sequence numbers of the embodiments in this application are for descriptive purposes only and do not represent the superiority or inferiority of the embodiments.

[0192] Through the above description of the embodiments, those skilled in the art can clearly understand that the methods of the above embodiments can be implemented by means of software plus necessary general-purpose hardware platforms. Of course, they can also be implemented by hardware, but in many cases the former is a better implementation method. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product is stored in a storage medium (such as ROM / RAM, magnetic disk, optical disk) and includes several instructions to cause a terminal device (which may be a mobile phone, computer, server, or network device, etc.) to execute the methods provided in the various embodiments of this application.

[0193] Other embodiments of this application will readily occur to those skilled in the art upon consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of this application that follow the general principles of this application and include common knowledge or customary techniques in the art not disclosed herein. The specification and examples are to be considered exemplary only, and the true scope and spirit of this application are indicated by the following claims.

[0194] It should be understood that this application is not limited to the precise structure described above and shown in the accompanying drawings, and various modifications and changes can be made without departing from its scope. The scope of this application is limited only by the appended claims.

Claims

1. A gesture recognition method based on a neural network model, characterized in that, The neural network model includes a feature extraction module, a semantic feature representation module, a key point feature representation module, and a classifier; the method includes: The feature extraction module is used to extract features from the gesture image to be recognized to obtain initial features; Using the semantic feature representation module, pose semantic features are obtained based on the initial features; Using the key point feature representation module, based on the initial features, the coordinates and probability values of multiple key points in the gesture image are obtained, and based on the coordinates and probabilities of each key point, the Gaussian distribution data of each key point is determined; Using the classifier, the gesture image is classified into at least one predefined gesture type based on the fusion result of the gesture semantic features and the Gaussian distribution data of the multiple key points.

2. The method according to claim 1, characterized in that, The method further includes: Calculate the product of the posture semantic features and the Gaussian distribution data of each key point to obtain the Gaussian thermal distribution data of each key point; The classifier, based on the fusion result of the pose semantic features and the Gaussian distribution data of the multiple key points, classifies the gesture image into at least one predefined gesture type, including: Using the classifier, the gesture image is classified into at least one predefined gesture type based on the Gaussian heat distribution data of each key point.

3. The method according to claim 2, characterized in that, The semantic feature representation module includes a first transformation unit and a feature processing unit; the step of using the semantic feature representation module to obtain pose semantic features based on the initial features includes: Using the first conversion unit, the initial features are converted into an attitude matrix; Using the feature processing unit, the pose semantic features are determined based on the correlation between features at different positions in the pose matrix.

4. The method according to claim 3, characterized in that, The step of using the classifier to classify the gesture image into at least one predefined gesture type based on the Gaussian heat distribution data of each key point includes: Using the classifier, based on the concatenation result of the pose matrix and the Gaussian heat distribution data of the multiple key points, the gesture image is classified into at least one predefined gesture type.

5. The method according to claim 3, characterized in that, The method further includes: If the plurality of key points includes key points with probability values lower than a preset value, then the attitude matrix is spliced with the Gaussian thermal distribution data of each key point in the plurality of key points to obtain a spliced feature matrix; The concatenated feature matrix is input into the classifier so that the classifier classifies the gesture image into at least one predefined gesture type based on the concatenated feature matrix.

6. The method according to claim 3, characterized in that, The method further includes: If the probability values of the multiple key points are all greater than or equal to a preset value, then the Gaussian heat distribution data of each key point is input into the classifier, so that the classifier classifies the gesture image into at least one predefined gesture type based on the Gaussian heat distribution data of each key point.

7. The method according to any one of claims 1-6, characterized in that, The neural network model further includes a key point calculation unit; the method further includes: Using the key point calculation unit, the coordinates and probability values of the multiple key points are re-determined based on the input of the classifier.

8. The method according to claim 7, characterized in that, The method further includes: Based on the re-determining of the coordinates of the multiple key points, the target item for the user's hand operation is determined; Based on the gesture type, control instructions for the target item are generated.

9. A model training method, characterized in that, The method for training the neural network model provided by any one of claims 1-8 includes: Acquire multiple gesture image samples and their gesture labels; The neural network model is trained based on the multiple gesture image samples and their gesture labels until the training termination condition is met.

10. The method according to claim 9, characterized in that, The neural network model includes a backbone network, intermediate layers, and a detection head; the backbone network includes a feature extraction module, the intermediate layers include a semantic feature representation module and a key point feature representation module, and the detection head includes a classifier; the key point feature representation module includes a second conversion unit and a second key point calculation unit, the initial features are converted by the second conversion unit and then input into the second key point calculation unit to obtain the coordinates and probability values of multiple key points in the gesture image; During training, the detection head also includes a first keypoint calculation unit, which is used to recalculate the coordinates and probability values of the plurality of keypoints based on the input of the classifier; The loss function of the neural network model includes a loss term for the gesture label and the gesture type predicted by the neural network model, and a loss term for the coordinates and probability values of the plurality of key points output by the first key point calculation unit and the coordinates and probability values of the plurality of key points output by the second key point calculation unit.

11. A gesture recognition device based on a neural network model, characterized in that, The neural network model includes a feature extraction module, a semantic feature representation module, a key point feature representation module, and a classifier; the device includes: An initial feature extraction module is used to extract features from the gesture image to be recognized, thereby obtaining initial features from the gesture image. The pose semantic acquisition module is used to obtain pose semantic features based on the initial features using the semantic feature representation module; The key point extraction module is used to obtain the coordinates and probability values of multiple key points in the gesture image based on the initial features using the key point feature representation module, and to determine the Gaussian distribution data of each key point based on the coordinates and probability of each key point. The gesture recognition module is used to classify the gesture image into at least one predefined gesture type using the classifier based on the fusion result of the pose semantic features and the Gaussian distribution data of the multiple key points.

12. A model training device, characterized in that, The method for training the neural network model provided by any one of claims 1-8 includes: The sample acquisition module is used to acquire multiple gesture image samples and their gesture labels; The model training module is used to train the neural network model based on the multiple gesture image samples and their gesture labels until the training termination condition is met.

13. An electronic device, characterized in that, include: A processor and a memory, wherein code is stored in the memory, and the processor executes the code stored in the memory to perform the method as described in any one of claims 1-10.

14. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores computer-executable instructions, which, when executed by a processor, implement the method as described in any one of claims 1-10.

15. A computer program product, characterized in that, Includes a computer program that, when executed by a processor, implements the method as described in any one of claims 1-10.