Gesture recognition method and apparatus, electronic device, and storage medium

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By performing hand detection and recognition on video images and updating the gesture data list, the accuracy problem of gesture recognition in occluded situations is solved, improving the accuracy of gesture recognition and the user experience.

CN117133017BActive Publication Date: 2026-06-26CHENGDU BOE SMART TECH CO LTD +2

View PDF 3 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: CHENGDU BOE SMART TECH CO LTD
Filing Date: 2023-08-28
Publication Date: 2026-06-26

Application Information

Patent Timeline

28 Aug 2023

Application

26 Jun 2026

Publication

CN117133017B

IPC: G06V40/10; G06V10/25; G06V10/26; G06V10/764

CPC: G06V40/113; G06V10/25; G06V10/267; G06V10/764

AI Tagging

Technology Topics

Video imageGesture classification

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

An unmanned aerial vehicle reconnaissance video image target automatic detection method
CN122265886AEasy to adaptavoid false detectionCharacter and pattern recognition Imaging processing Feature extraction
Monitoring network construction information processing method and apparatus, terminal, and medium
WO2026124520A1Closed circuit television systems Information processing Data mining
A deep learning-based offside assist judgment method and device for football
CN122290011AHuman body Engineering
A high-precision measurement method for blast hole angle
CN122237506AAngle measurementShielded cableProgrammable logic controller
A report generation method and system based on ultrasound examination attention flow tracking
CN122266611ABiological models Character and pattern recognitionUltrasound sonographyRadiology

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing technology cannot correctly detect gestures when the user's hand is obstructed by the camera lens, resulting in decreased gesture recognition accuracy and affecting the user experience.

Method used

By performing hand detection and recognition on video images, gesture classification is obtained, and the historical gesture data list is updated, including gesture location, classification, and control, in order to track gesture control and avoid misidentification.

Benefits of technology

It improves the accuracy of gesture recognition, enhances the user experience, and ensures accurate gesture recognition even when the gesture is obscured.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN117133017B_ABST

Patent Text Reader

Abstract

The present disclosure provides a gesture recognition method, device, electronic equipment and storage medium, the method comprising: detecting a current video image to obtain at least one hand image containing a hand; then, recognizing each hand image to obtain a gesture classification; thereafter, updating a historical gesture data list according to the gesture classification to obtain a target gesture data list, the target gesture data list comprising a gesture position, a gesture classification and a gesture control right. The present scheme updates the historical gesture data list by gesture classification to track the gesture control right, avoids the problem of misrecognition or inaccurate control right caused by occlusion in the recognition process, and is beneficial to improve the accuracy of gesture recognition.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to the field of image processing technology, and in particular to a gesture recognition method, apparatus, electronic device, and storage medium. Background Technology

[0002] Gesture recognition technology offers excellent human-computer interaction and improves ease of use, and has received widespread attention in the industry. However, when other objects obstruct the view between the user's hand and the camera lens, the user's hand gestures cannot be accurately detected, thus reducing the user experience. Summary of the Invention

[0003] This disclosure provides a gesture recognition method, apparatus, electronic device, and storage medium to solve the aforementioned technical problems.

[0004] According to a first aspect of this disclosure, a gesture recognition method is provided, the method comprising:

[0005] Detect the current video image to obtain at least one hand image containing a hand;

[0006] Each hand image is identified to obtain a gesture classification;

[0007] The historical gesture data list is updated based on the gesture classification to obtain the target gesture data list, which includes gesture location, gesture classification, and gesture control.

[0008] Optionally, the current video image is detected to obtain at least one hand image containing a hand, including:

[0009] Hand detection is performed on the current video image to obtain a first hand detection result; the first hand detection result includes a first detection result in which a hand is detected or a second detection result in which a hand is not detected;

[0010] In response to determining that the first hand detection result is the second detection result, the current video image is divided into a preset number of image sub-blocks, and hand gesture detection is performed on each image sub-block to obtain the second hand detection result;

[0011] Redundancy filtering is performed on the first hand detection result and the second hand detection result to obtain the hand detection box;

[0012] Cropping the area containing the hand detection box in the current video image yields at least one hand image containing the gesture.

[0013] Optionally, the current video image is detected to obtain at least one hand image containing a hand, including:

[0014] A hand recognition model is obtained, wherein the input data of the hand recognition model is a video image, and the output data is the edge coordinate data of the hand region in the video image;

[0015] The current video image is input into the hand recognition model to obtain the edge coordinate data of the hand region in the current video image;

[0016] The current video image is cropped based on the edge coordinate data of the hand region to obtain at least one hand image containing the hand.

[0017] Optionally, the hand recognition model is trained through the following steps:

[0018] Obtain a set of hand image training samples, the set of hand image training samples includes multiple hand image training samples, each hand image training sample includes at least one gesture;

[0019] Each hand image training sample is sequentially input into the hand recognition model to obtain the hand recognition result;

[0020] Obtain the loss value between the hand recognition result and the labeled data of the hand image training samples;

[0021] In response to the loss value being less than or equal to a preset loss value threshold, the training of the hand recognition model is stopped, and the hand recognition model is obtained.

[0022] Optionally, the hand image training sample set includes a publicly available subset of image training samples and / or a customized subset of image training samples. Obtaining the customized subset of image training samples from the hand image training sample set includes:

[0023] Generate a customized image acquisition interface, which includes an image acquisition frame, a hand type label frame, a shooting control, and a save control;

[0024] In response to the detection that the shooting control is triggered, the captured customized image is displayed in the image acquisition frame;

[0025] Obtain the annotation data entered in the hand type annotation box;

[0026] In response to the detection that the save control is triggered, an initial sample of a customized image is generated, containing the customized image and the annotation data.

[0027] The initial customized image samples are subjected to preset processing to obtain multiple customized image samples, which are used as a subset of the customized image training samples.

[0028] The preset processing includes at least one of the following: size transformation, angle rotation, mosaic processing, and filtering processing.

[0029] Optionally, the historical gesture data list is updated according to the gesture classification to obtain the target gesture data list, including:

[0030] The hand images are sorted according to the size of the hand detection boxes, and the hand detection boxes and their gesture classifications in the historical gesture data list are updated.

[0031] In response to the control being empty, the control in the historical gesture data list is updated to the hand image with the largest hand detection box size and the control is locked to obtain the target gesture data list.

[0032] Optionally, the historical gesture data list is updated according to the gesture classification to obtain the target gesture data list, including:

[0033] In response to the fact that the control right is empty and no gesture category is detected, or in response to the fact that the control right is not empty and the duration for which the gesture with control right is not detected exceeds a preset duration, the video image is re-detected to obtain a third hand detection result.

[0034] The historical gesture data list is updated based on the third hand detection result.

[0035] Optionally, updating the historical gesture data list based on the third hand detection result includes:

[0036] In response to the third hand detection result being empty, control of the hand detection box in the historical gesture data list is released.

[0037] Optionally, updating the historical gesture data list based on the third hand detection result includes:

[0038] In response to the third hand detection result being non-empty, gesture classification is performed on the hand image, and the gesture classification of the hand detection box in the historical gesture data list is updated.

[0039] Optionally, the historical gesture data list is updated according to the gesture classification to obtain the target gesture data list, including:

[0040] The hand images are sorted according to the size of the hand detection boxes, and the hand detection boxes and their gesture classifications in the historical gesture data list are updated.

[0041] In response to a situation where the gesture classification of the hand detection box that has control in the current video image is different from the gesture classification of the previous frame video image in the historical gesture data list, control of the hand detection box is released.

[0042] According to a second aspect of this disclosure, a gesture recognition device is provided, the device comprising:

[0043] The hand image acquisition module is used to detect the current video image and obtain at least one hand image containing a hand;

[0044] The gesture classification acquisition module is used to identify each hand image and obtain a gesture classification.

[0045] The gesture list acquisition module is used to update the historical gesture data list according to the gesture category to obtain the target gesture data list, which includes gesture location, gesture category and gesture control.

[0046] Optionally, the hand image acquisition module includes:

[0047] The first result detection module is used to perform hand detection on the current video image and obtain a first hand detection result; the first hand detection result includes a first detection result in which a hand is detected or a second detection result in which a hand is not detected;

[0048] The second result detection module is used to, in response to determining that the first hand detection result is the second detection result, divide the current video image into a preset number of image sub-blocks, and perform hand gesture detection on each image sub-block to obtain the second hand detection result;

[0049] The hand detection box acquisition module is used to perform redundancy filtering on the first hand detection result and the second hand detection result to obtain the hand detection box;

[0050] The hand image acquisition module is used to crop the area where the hand detection box is located in the current video image to obtain at least one hand image containing a gesture.

[0051] Optionally, the hand image acquisition module includes:

[0052] The recognition model acquisition submodule is used to acquire a hand recognition model. The input data of the hand recognition model is a video image, and the output data is the edge coordinate data of the hand region in the video image.

[0053] The edge coordinate acquisition submodule is used to input the current video image into the hand recognition model to obtain the edge coordinate data of the hand region in the current video image;

[0054] The hand image acquisition module is used to crop the current video image based on the edge coordinate data of the hand region to obtain at least one hand image containing the hand.

[0055] Optionally, the device further includes a model training module for training the hand recognition model, the model training module comprising:

[0056] The sample set acquisition submodule is used to acquire a hand image training sample set, which includes multiple hand image training samples, and each hand image training sample includes at least one gesture.

[0057] The recognition result acquisition submodule is used to sequentially input each hand image training sample into the hand recognition model to obtain the hand recognition result;

[0058] The loss value acquisition submodule is used to acquire the loss value between the hand recognition result and the labeled data of the hand image training samples;

[0059] The recognition model acquisition submodule is used to stop hand recognition model training in response to the loss value being less than or equal to a preset loss value threshold, thereby obtaining the hand recognition model.

[0060] Optionally, the hand image training sample set includes a publicly available image training sample subset and / or a customized image training sample subset, and the sample set acquisition submodule includes:

[0061] The image acquisition interface generation unit is used to generate a customized image acquisition interface, which includes an image acquisition frame, a hand type annotation frame, a shooting control, and a save control.

[0062] A customized image acquisition unit is used to display the captured customized image within the image acquisition frame in response to detecting that the shooting control is triggered;

[0063] The annotation data acquisition unit is used to acquire the annotation data input within the hand type annotation box;

[0064] An initial sample generation unit is configured to generate an initial sample of a customized image containing the customized image and the annotation data in response to detecting that the save control is triggered.

[0065] The sample subset acquisition unit is used to perform preset processing on the initial sample of the customized image to obtain multiple customized image samples, which are used as the training sample subset of the customized image.

[0066] The preset processing includes at least one of the following: size transformation, angle rotation, mosaic processing, and filtering processing.

[0067] Optionally, the gesture list acquisition module includes:

[0068] The history list update submodule is used to sort each hand image according to the size of the hand detection box, and update each hand detection box and its gesture category in the history gesture data list;

[0069] The gesture list acquisition submodule is used to update the control of the historical gesture data list to the hand image with the largest hand detection box size and lock the control in response to the gesture category being non-empty, thereby obtaining the target gesture data list.

[0070] Optionally, the gesture list acquisition module includes:

[0071] The third result acquisition submodule is used to re-detect the video image in response to the fact that the control right is empty and no gesture category is detected, or in response to the fact that the control right is not empty and the gesture with control right has not been detected for a duration exceeding a preset duration, to obtain the third hand detection result.

[0072] The gesture list acquisition submodule is used to update the historical gesture data list based on the third hand detection result.

[0073] Optionally, the gesture list acquisition submodule includes:

[0074] The control release submodule is used to release the control of the hand detection box in the historical gesture data list in response to the third hand detection result being empty.

[0075] Optionally, the gesture list acquisition submodule includes:

[0076] The control update submodule is used to perform gesture classification on the hand image and update the gesture classification of the hand detection box in the historical gesture data list in response to the third hand detection result being non-empty.

[0077] Optionally, the gesture list acquisition module includes:

[0078] The gesture classification update submodule is used to sort each hand image according to the size of the hand detection box and update each hand detection box and its gesture classification in the historical gesture data list.

[0079] The control release submodule is used to release the control of the hand detection box in response to the fact that the gesture classification of the hand detection box with control in the current video image is different from the gesture classification of the previous frame video image in the historical gesture data list.

[0080] According to a third aspect of this disclosure, an electronic device is provided, comprising:

[0081] Processor and memory;

[0082] The memory is used to store computer programs that can be executed by the processor;

[0083] The processor is configured to execute a computer program in the memory to implement the method as described in any of the first aspects.

[0084] According to a fourth aspect of this disclosure, a non-transitory computer-readable storage medium is provided, which, when an executable computer program in the storage medium is executed by a processor, enables the implementation of the method as described in any of the first aspects.

[0085] The technical solutions provided by the embodiments of this disclosure may include the following beneficial effects:

[0086] The solution provided in this embodiment can detect the current video image to obtain at least one hand image containing a hand; then, it identifies each hand image to obtain a gesture classification; subsequently, it updates the historical gesture data list according to the gesture classification to obtain a target gesture data list, which includes gesture position, gesture classification, and gesture control. In this way, this solution updates the historical gesture data list through gesture classification to track gesture control, avoiding misidentification or inaccurate control due to occlusion during the recognition process, thus improving the accuracy of gesture recognition and enhancing the user experience.

[0087] It should be understood that the above general description and the following detailed description are exemplary and explanatory only, and are not intended to limit this disclosure. Attached Figure Description

[0088] Figure 1 This is a flowchart of a gesture recognition method according to an embodiment of the present disclosure.

[0089] Figure 2 This is a flowchart illustrating an embodiment of the present disclosure for acquiring a hand image.

[0090] Figure 3 This is a flowchart illustrating an embodiment of the present disclosure of obtaining a customized subset of image training samples.

[0091] Figure 4 This is a schematic diagram of an image acquisition interface according to an embodiment of the present disclosure.

[0092] Figure 5 This is a schematic diagram of an embodiment of the present disclosure of input annotation data.

[0093] Figure 6 This is a flowchart illustrating a control update according to an embodiment of the present disclosure.

[0094] Figure 7 This is a schematic diagram of a control gesture according to an embodiment of the present disclosure.

[0095] Figure 8 This is a flowchart of a gesture recognition method according to an embodiment of the present disclosure.

[0096] Figure 9This is a block diagram of a gesture recognition device according to an embodiment of the present disclosure. Detailed Implementation

[0097] Exemplary embodiments will now be described in detail, examples of which are illustrated in the accompanying drawings. When the following description relates to the drawings, unless otherwise indicated, the same numerals in different drawings denote the same or similar elements. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with this disclosure. Rather, they are merely examples of apparatuses consistent with some aspects of this disclosure as detailed in the appended claims.

[0098] To address the aforementioned technical problems, this disclosure provides a gesture recognition method, apparatus, electronic device, and storage medium. The gesture recognition method is applicable to electronic devices, including but not limited to mobile phones, computers, tablets, e-readers, 3D displays, all-in-one conference machines, and electronic whiteboards—devices with interactive functions. See also... Figure 1 The gesture recognition method provided in this embodiment includes steps 11 to 13.

[0099] In step 11, the current video image is detected to obtain at least one hand image containing a hand.

[0100] In one embodiment, the electronic device can detect the current video image to obtain at least one hand image containing a hand, see [link to previous embodiment]. Figure 2 This includes steps 21 to 24.

[0101] In step 21, hand detection is performed on the current video image to obtain a first hand detection result; the first hand detection result includes a first detection result in which a hand is detected or a second detection result in which a hand is not detected.

[0102] In this step, the electronic device stores a hand detection model for detecting hands in the input image. This hand detection model can be implemented using a fast inference network model, such as the RFB-Net network model. Technicians can choose the appropriate model, and the corresponding solution falls within the protection scope of this disclosure.

[0103] In this step, the hand detection model mentioned above is a pre-trained network model, and its training process may include:

[0104] (1) Construct a training sample set.

[0105] A set of images containing human hands is selected, with each image showing a different hand pose (e.g., front view, top view, side view, etc.) to form an initial sample set. Then, enhancement processing can be applied to each image in the initial sample set. For example, at least two images can be randomly selected and stitched together to create a stitched image, which is then added to the initial sample set, thus increasing the number of images in the initial sample set. Alternatively, the size of each image in the initial sample set can be randomly reduced or enlarged to obtain scaled images of different sizes, which are then added to the initial sample set, further increasing the number of images. Another option is to resize each image in the initial sample set to obtain images of the same size, thus obtaining a training sample set.

[0106] (2) Select a hand detection model.

[0107] To construct a hand detection model, a lightweight network model with fast inference speed, such as the RFB-Net network model, should be selected. In one example, the initial hand detection model can be trained using transfer learning to improve training efficiency.

[0108] (3) Train the hand detection model.

[0109] The training sample set is divided into a training subset and a validation subset. Images from the training subset are sequentially input into the initial hand detection model for training. Images from the validation subset are then input into the trained hand detection model, and the recognition accuracy is calculated. The hand detection model is considered complete when the recognition accuracy exceeds a preset threshold (e.g., 98%, adjustable). Alternatively, the hand detection model is considered complete when the number of images input into the training subset exceeds a preset threshold (e.g., tens of thousands, adjustable).

[0110] In this step, when the electronic device acquires the video stream, it can sequentially process each image as if it were the current video image. Subsequent embodiments will only describe the processing of a single current video image. The electronic device can input the current video image into a hand detection model for hand detection, obtaining a hand detection result, which will be referred to hereafter as the first hand detection result for distinction. This first hand detection result includes either a first detection result indicating that a hand was detected or a second detection result indicating that a hand was not detected.

[0111] Understandably, the first detection result may include the location of the hand in the current video image, the location of the hand detection box in the hand area, etc. The content of the detection result can be set according to the specific scenario, and is not limited here.

[0112] In step 22, in response to determining that the first hand detection result is the second detection result, the current video image is divided into a preset number of image sub-blocks, and hand gesture detection is performed on each image sub-block to obtain the second hand detection result.

[0113] In this step, the first hand detection result can include the second detection result, i.e., no hand is detected in the current video image. In this case, the electronic device can respond by determining the first hand detection result as the second detection result and segmenting the current video image into a preset number of image sub-blocks. The preset number can include, but is not limited to, 4, 9, 16, 25, etc., and can be set according to the specific scenario. Then, the size of each image sub-block of the current video image is enlarged to the size of the current video image and input into the gesture detection model for hand detection, obtaining the hand detection result, which will be referred to as the second hand detection result for distinction. The second hand detection result can include either the first detection result where a hand was detected or the second detection result where no hand was detected.

[0114] In this step, the current video image is segmented into image sub-blocks and then enlarged for processing. This can help find smaller hands that were not detected in the current video image, thereby improving the accuracy of hand detection.

[0115] It should be noted that the above embodiment describes a scheme for segmenting the current video image once. If the second hand detection result is still the second detection result, i.e., no hand is detected within the image sub-block, the image sub-block can be further segmented into multiple image sub-blocks, with the number of sub-blocks being the same as the number of sub-blocks in the first segmentation. Considering that the detection results after the unsegmented detection and the first segmentation and re-detection can usually meet the requirements, in one example, the number of sub-blocks after the first segmentation can decrease. For example, the current video image is first segmented into 9 image sub-blocks; the second time, each image sub-block is segmented into 4 image sub-blocks; and the third time, each image sub-block from the second segmentation is segmented into 2 image sub-blocks. After each segmentation, the size of the segmented image sub-blocks is adjusted, and the hand detection is adjusted to obtain the second hand detection result. That is to say, in one example, the electronic device can perform at least one segmentation and hand detection process on the current video image to improve the accuracy of hand detection.

[0116] In step 23, the first hand detection result and the second hand detection result are subjected to redundancy filtering to obtain a hand detection box.

[0117] Considering that the first and second hand detection results can include multiple first detection results (i.e., multiple hands detected in the current video image), and these multiple hands may overlap, redundancy filtering is required. In one example, the electronic device can use a non-maximum suppression (NMS) method to filter the hand detection boxes in the first and second hand detection results, retaining the largest hand detection box among the overlapping hand detection boxes, thus obtaining the filtered hand detection boxes.

[0118] It should be noted that, considering that the second hand detection result is the detection result of the hand in the image sub-block, the position of the hand detection box can be restored to the current video image according to the positional relationship between the current video image and the image sub-block, thereby ensuring the accuracy of the detection result.

[0119] In step 24, the area containing the hand detection box in the current video image is cropped to obtain at least one hand image containing a gesture.

[0120] In this step, after obtaining the hand detection bounding box of the hand in the current video image, the electronic device can crop the area where each hand detection bounding box is located in the current video image to obtain at least one hand image containing the gesture.

[0121] In another embodiment, an electronic device can detect a current video image to obtain at least one hand image containing a hand, including: acquiring a hand recognition model, wherein the input data of the hand recognition model is a video image, and the output data is edge coordinate data of a hand region in the video image; inputting the current video image into the hand recognition model to obtain the edge coordinate data of the hand region in the current video image; and cropping the current video image according to the edge coordinate data of the hand region to obtain at least one hand image containing a hand.

[0122] In this embodiment, the above-mentioned hand recognition model is trained through the following steps: obtaining a set of hand image training samples, the set of hand image training samples including multiple hand image training samples, each hand image training sample including at least one gesture; sequentially inputting each hand image training sample into the hand recognition model to obtain a hand recognition result; obtaining the loss value between the hand recognition result and the labeled data of the hand image training samples; and stopping the training of the hand recognition model in response to the loss value being less than or equal to a preset loss value threshold, thereby obtaining the hand recognition model.

[0123] In one example, the hand image training sample set includes a publicly available subset of image training samples and / or a customized subset of image training samples. Taking obtaining a customized subset of image training samples as an example, the electronic device obtains a customized subset of image training samples from the hand image training sample set, see [link to relevant documentation]. Figure 3 This includes steps 31 to 35.

[0124] In step 31, a customized image acquisition interface is generated, which includes an image acquisition frame, a hand type label frame, a shooting control, and a save control.

[0125] See Figure 4 The customized image acquisition interface 41 includes an image acquisition frame 42, a hand type labeling frame 44, a shooting control 43, and a save control 45. The camera of the electronic device can capture a preview image of the user and display it within the image acquisition frame 42. In some possible examples, prompts such as "move closer," "move away," or "adjust hand direction" can be displayed inside or outside the image acquisition frame to obtain different preview images. When the preview image meets the shooting requirements, such as when the hand clarity exceeds a preset clarity threshold, the user can trigger the shooting control 43.

[0126] In step 32, in response to the detection that the shooting control is triggered, the captured customized image is displayed in the image acquisition frame.

[0127] In step 33, the annotation data entered in the hand type annotation box is obtained.

[0128] See Figure 5 The annotation data entered in the hand type annotation box is "OK gesture".

[0129] In step 34, in response to the detection that the save control is triggered, an initial sample of the customized image containing the customized image and the annotation data is generated.

[0130] In step 35, the initial customized image samples are subjected to preset processing to obtain multiple customized image samples, which serve as a subset of the customized image training samples. The preset processing includes at least one of the following: size transformation, angle rotation, mosaic processing, and filtering processing.

[0131] In this step, the electronic device can customize initial image samples for pre-processing. For example, each customized initial image can be adjusted to a different size to obtain a images of different sizes; then, each image in the a images of different sizes can be subjected to different degrees of mosaic processing to obtain a*a images of different blur levels; subsequently, each image in the a*a images of different blur levels can be filtered to obtain a*a*a images with different filtering levels; finally, each image in the a*a*a images with different filtering levels can be rotated to obtain a*a*a*a images with different angles. Thus, after the above pre-processing, a maximum of (1+a+a*a+a*a*a+a*a*a*a) images can be obtained as a subset of customized image training samples. Of course, the number of samples in the subset of customized image training samples obtained by pre-processing can be selected according to the specific scenario, and the corresponding scheme falls within the protection scope of this disclosure.

[0132] It should be noted that the aforementioned customized image training sample subset can be obtained during the initial training of the gesture recognition model, or it can be used to add training samples when the user adds more gesture types. The gesture recognition model can be further trained using the newly added customized image training sample subset, thereby ensuring that the gesture recognition model can recognize the various gestures required by the user and improving the applicability of the gesture recognition model.

[0133] The following example demonstrates how to obtain a gesture recognition model for a specific scenario:

[0134] 1. Preprocessing of training sample set

[0135] 1. Obtain a training sample set of hand images.

[0136] 2. Set the uniform size of each training sample in the hand image training sample set to 1280×720 pixels.

[0137] 3. The Perona-Malik equation is used to denoise the training samples. The Perona-Malik equation is:

[0138]

[0139] In equation (1), Let g be the magnitude of the gradient, g be the marginal function, and x and y be the horizontal and vertical coordinates of the training samples, respectively.

[0140] 4. Perform Poisson matting on multiple training samples containing hands in the hand image training sample set to obtain the weight alpha;

[0141] (41) Perform global image matting processing, as shown in equations (2) and (3).

[0142]

[0143]

[0144] In equation (2), I represents the hand region in the training sample, α represents the weight, F is the training sample image, B is the background image, and Δ is the gradient operator.

[0145] (42) Perform local image cutout processing, as shown in equation (4).

[0146]

[0147] In equation (4), I represents the final matting result, and α represents the weight. The local region weights are obtained through iterative optimization, and Gaussian filtering is applied to optimize the edge information, as shown in Equation (5).

[0148]

[0149] In equation (5), x and y are the horizontal and vertical coordinates, and σ is the standard deviation.

[0150] (43) Improve the weight value in the global matting based on the weight of the local matting to obtain the weight alpha of the final image.

[0151] 5. Normalize the training samples as shown in equation (6).

[0152] Input=(Input-mean) / std; (6)

[0153] In equation (6), Input represents the image data, mean represents the average value, which is [0.5, 0.5, 0.5], and std represents the variance value. In one example, std is [0.229, 0.224, 0.225].

[0154] II. Model Training

[0155] 1. Select a lightweight hand recognition model (such as MobileNetV3, FBNetV3, etc.) to process the training samples.

[0156] 2. Use h-wish as the activation function, as shown below:

[0157]

[0158] In equation (7), x is the output of the previous layer, and ReLU is the ReLU activation function.

[0159] 3. Output the processing result, which is alpha.

[0160] 4. Calculate the mean squared error loss by combining the weights output by the model with the weights alpha initially obtained through image processing, as shown in Equation (8).

[0161]

[0162] The gradient of each weight in the model is updated based on the calculated loss.

[0163] 5. Update model weights using gradients.

[0164] 6. Output the optimal PTH network model.

[0165] Third, model conversion.

[0166] 1. Convert the PTH model to an ONNX model;

[0167] 2. Convert the ONNX model to the OpenVino model;

[0168] 3. Quantize the OpenVino model.

[0169] IV. Model Deployment Process:

[0170] 1. Resize the input image or frame, for example, to 1920x1080 pixels;

[0171] 2. Denoise the image using the Perona-Malik equation;

[0172] 3. Normalize the input image or image frame;

[0173] 4. Input the normalized data into the OpenVino model to obtain the weights alpha;

[0174] 5. Background blending is performed using weight alpha, and the blending method is as follows;

[0175] I=alpha×F+(1-alpha)×B; (9)

[0176] In equation (9), I represents the final matting result, i.e., the hand image, alpha represents the weight, F is the input training sample image, and B is the background image.

[0177] In step 12, each hand image is identified to obtain a gesture classification.

[0178] In this step, the electronic device stores a gesture classification model for detecting gesture classification in the input image. This gesture classification model can be implemented using a lightweight network model with fast classification speed, such as the MobileNetV2 network model. Technicians can choose the appropriate model, and the corresponding solution falls within the protection scope of this disclosure. By employing a lightweight gesture classification model in this step, gesture classification of the hands in the current video image can be ensured during video display, guaranteeing classification efficiency.

[0179] In this step, the gesture classification model mentioned above is a pre-trained network model, and its training process may include:

[0180] (1) Construct separate training sample sets.

[0181] Select several images containing human hands. The hand gestures in each image are classified and numbered differently. The gesture classification can include, but is not limited to, OK type, thumbs up type (first), scissors type, number 1 type, number 6 type, etc. The gesture type can be selected according to the specific scenario. At this time, an initial classification sample set can be obtained.

[0182] It should be noted that the above gesture classifications are related to the interaction scenario. For example, in a control scenario, you can choose the number 1 to represent the first step, the scissor gesture to represent the second step, the OK gesture to represent confirmation, and the number 6 to represent returning to the previous step, etc.

[0183] Then, the images in the initial classification sample set can be enhanced, for example, by randomly rotating them horizontally and / or vertically to obtain rotated images and adding them to the initial classification sample set, or by randomly adding noise data (such as Gaussian blur) to each image to obtain noisy images and adding them to the initial classification sample set, thereby increasing the number of images in the initial classification sample set; or by resizing and normalizing the pixel data of each image in the initial classification sample set to obtain a gesture training sample set.

[0184] (2) Select a gesture classification model.

[0185] Build a gesture classification model, choosing a lightweight network model with fast classification speed, such as the MobileNetV2 network model. In one example, the initial gesture classification model can be trained using transfer learning to improve training efficiency.

[0186] (3) Train the gesture classification model.

[0187] The gesture training sample set is divided into a training subset and a validation subset. Images from the training subset are sequentially input into the initial gesture classification model for training. Images from the validation subset are then input into the trained gesture classification model, and the classification accuracy is calculated. The gesture classification model is considered complete when the classification accuracy exceeds a preset threshold (e.g., 98%, adjustable). Alternatively, the model is considered complete when the number of images input into the training subset exceeds a preset threshold (e.g., tens of thousands, adjustable).

[0188] In this step, the electronic device can input various hand images into the gesture classification model for classification and detection, obtaining the gesture classification corresponding to each hand image. In one example, gestures can be classified into 8 categories: the number 1 type (first), palm type (palm), figure eight type (eight), OK type (ok), scissors type (scissors), and thumbs-up type (good), plus other gesture categories.

[0189] In step 13, the historical gesture data list is updated according to the gesture classification to obtain the target gesture data list, which includes gesture location, gesture classification, and gesture control.

[0190] In this step, the electronic device can update the historical gesture data list based on the gesture classification of each hand image to obtain the target gesture data list. The target gesture data list includes the hand detection bounding box position, gesture classification, and gesture control weight. The gesture control weight represents the gesture that needs to be performed in the current video image.

[0191] In one example, the format of the gesture data list is shown in Table 1.

[0192] Table 1. Gesture Data List

[0193]

[0194]

[0195] See one example. Figure 6 The electronic device can sort the hand images according to the size of each hand detection box and update the hand detection boxes and their gesture categories in the historical gesture data list. For example, a new row can be inserted after the last row of Table 1 to list the position of each hand detection box and the gesture category. Then, the electronic device can determine whether the control of the previous video image in the historical gesture data list is empty. When the control is empty, the electronic device can update the control of the historical gesture data list to the hand image with the largest hand detection box size and lock the control, thereby obtaining the target gesture data list.

[0196] In another example, see [link to example]. Figure 6 The electronic device can sort the hand images according to the size of the hand detection box and update the position and gesture classification of the hand detection boxes in the historical gesture data list. For example, it can create a row of data for the current video image in Table 1. Then, when the gesture classification of the hand detection box with control in the current video image is different from the gesture classification of the previous frame video image in the historical gesture data list (i.e., the same hand has a different gesture classification in the current video image and the previous video image), the electronic device can release the control of the hand detection box and set the control to null, for example, set the value of the control to 0.

[0197] In yet another example, see [link to example]. Figure 6 When the control of the previous video image in the historical gesture data list is empty and no gesture category is detected, or when the control of the previous video image in the historical gesture data list is not empty and the duration for which the gesture with control is not detected exceeds a preset duration (e.g., the duration corresponding to 3-5 frames of video images), the electronic device can perform re-detection processing on the video image to obtain a third hand detection result.

[0198] In this example, the electronic device performs re-detection processing on the video image, including:

[0199] (1) Initialize the search window.

[0200] (2) Color projection.

[0201] (3) Iterative optimization to find the extreme value of the probability distribution to locate the target (hand);

[0202] (31) Select the search box W using the color probability distribution map obtained through color projection;

[0203] (32) Calculate the zero-order interval: M 00 =∑ x ∑ y I(x, y);

[0204] (33) Calculate the first-order moment: M 10 =∑ x ∑ y xI(x, y); M 01 =∑ x ∑ y yI(x, y);

[0205] (34) Calculate the centroid of the search window: x c =M 10 / M 00 ;y c =M 01 / M00 ;

[0206] (35) Adjust the size of the search window and obtain the size and center position of the search window.

[0207] (4) Recalculate the size and center position of the search window in the next frame image and return to step (2) to recalculate.

[0208] In this example, the electronic device can update the historical gesture data list based on the third-party hand detection results, including:

[0209] When the third hand detection result is empty, meaning no hand is detected in the current video image, the electronic device can release control of the hand detection box in the historical gesture data list. In one example, if no hand is detected in the video image of the video stream for a long time (exceeding a preset duration), it indicates that the hand may be occluded. In this case, control of the gesture detection box can be released to ensure the effectiveness of control.

[0210] When the third hand detection result is not empty, that is, when a hand is detected in the current video image, the hand image corresponding to the current video image is classified into gestures, and the gesture classification of the hand detection box in the historical gesture data list is updated according to the gesture classification. Then, the historical gesture data list is updated according to the gesture classification. For example, the control is updated to the hand image with the largest hand detection box size, so as to achieve the effect of transferring control.

[0211] It should be noted that in this embodiment, the target gesture data list can be stored in local memory and / or the cloud. For example, the electronic device can store the target gesture data list in local memory and directly read the target gesture data list. Alternatively, the electronic device can upload the target gesture data list to the cloud, which can reduce the occupation of local memory and reduce the cost of the electronic device. Or, the electronic device can upload part of the target gesture data list to the cloud while storing the rest in local memory, and then combine the data from the target gesture data list read from the cloud with the locally stored data to obtain the final target gesture data list.

[0212] In one embodiment, after identifying the hand with gesture control, the electronic device can display the gesture in a preview interface, such as... Figure 7 As shown, the gesture type can be displayed around the hand that has control, such as "OK"; while the smaller "1" shaped gesture does not display its type, thus reminding the user.

[0213] In one embodiment, after determining the hand that controls the gesture, when the gesture indicates that the currently displayed image should be cut out, the electronic device can use the above-described method of obtaining the hand recognition model to obtain the target object in the image. The difference is that the hand recognition model only needs to be updated to recognize the target object, and the sample images in the training sample set contain the target object. The training process is the same as the training process of the hand recognition model described above, and will not be described again here.

[0214] Thus, the solution provided in this embodiment can detect the current video image to obtain at least one hand image containing a hand; then, recognize each hand image to obtain a gesture classification; subsequently, update the historical gesture data list according to the gesture classification to obtain a target gesture data list, which includes gesture position, gesture classification, and gesture control. In this way, this solution updates the historical gesture data list through gesture classification to track gesture control, avoiding misidentification or inaccurate control due to occlusion during the recognition process, thereby improving the accuracy of gesture recognition and enhancing the user experience.

[0215] The following describes a gesture recognition method provided in this disclosure with reference to an embodiment. See also Figure 8 ,include:

[0216] The camera of an electronic device can capture or record a video stream and use each frame of the video stream as the current video image.

[0217] The electronic device can input the current video image into the hand detection model to obtain a first hand detection result. When the first hand detection result indicates a second detection result indicating that no hand was detected, the electronic device can segment the current video image into a preset number of image sub-blocks; then, after adjusting the size of each image sub-block, it can input it into the hand detection model to obtain a second hand detection result.

[0218] The electronic device can perform redundancy filtering on the first and second hand detection results to obtain a hand detection bounding box. Then, based on the hand detection bounding box, the image is cropped from the current video image to obtain the hand image.

[0219] Electronic devices can sequentially input various hand images into a gesture classification model to obtain gesture analysis corresponding to each hand image. Considering that there is a one-to-one correspondence between hand images and hand detection boxes, this can also be called gesture classification based on hand detection boxes.

[0220] Electronic devices can acquire a list of historical gesture data as shown in Table 1 and update the list of historical gesture data according to the gesture classification of the hand detection box.

[0221] When the control in the historical gesture data list is empty and the current video image recognizes a gesture type, the hand detection boxes can be sorted by size, and the position and gesture type of the hand detection boxes in the historical gesture data list can be updated; and the control can be updated to the largest hand detection box.

[0222] When the control in the historical gesture data list is not empty and the current video image recognizes a gesture type that is different from the gesture type of the same gesture detection box in the previous frame video image, the hand detection boxes can be sorted according to size, the position and gesture type of the hand detection boxes in the historical gesture data list can be updated, and the control can be released.

[0223] When the control in the historical gesture data list is not empty and the current video image does not recognize the gesture type, or when the control in the historical gesture data list is empty and the current video image does not recognize the gesture type, the current video image can be re-detected to obtain the third hand detection result.

[0224] When the third hand detection result is empty, meaning no hand is detected in the current video image, the electronic device can release the control of the gesture detection box in the historical gesture data list. In other words, if no hand is detected in the video image of the video stream for a long time (more than the preset time), it means that the hand may be occluded. At this time, the control of the most recent gesture detection box can be released to ensure the effectiveness of the control.

[0225] When the third hand detection result is not empty, that is, when a hand is detected in the current video image, the hand image corresponding to the current video image is classified into gestures, and the gesture classification of the hand detection box in the historical gesture data list is updated according to the gesture classification. Then, the historical gesture data list is updated according to the gesture classification.

[0226] Based on the gesture recognition method provided in this disclosure, this disclosure also provides a gesture recognition device, see [link to relevant documentation]. Figure 9 The device includes:

[0227] The hand image acquisition module 91 is used to detect the current video image and obtain at least one hand image containing a hand;

[0228] The gesture classification acquisition module 92 is used to identify each hand image and obtain a gesture classification.

[0229] The gesture list acquisition module 93 is used to update the historical gesture data list according to the gesture classification to obtain the target gesture data list, wherein the target gesture data list includes gesture position, gesture classification and gesture control.

[0230] In one embodiment, the hand image acquisition module includes:

[0231] The first result detection module is used to perform hand detection on the current video image and obtain a first hand detection result; the first hand detection result includes a first detection result in which a hand is detected or a second detection result in which a hand is not detected;

[0232] The second result detection module is used to, in response to determining that the first hand detection result is the second detection result, divide the current video image into a preset number of image sub-blocks, and perform hand gesture detection on each image sub-block to obtain the second hand detection result;

[0233] The hand detection box acquisition module is used to perform redundancy filtering on the first hand detection result and the second hand detection result to obtain the hand detection box;

[0234] The hand image acquisition module is used to crop the area where the hand detection box is located in the current video image to obtain at least one hand image containing a gesture.

[0235] In one embodiment, the hand image acquisition module includes:

[0236] The recognition model acquisition submodule is used to acquire a hand recognition model. The input data of the hand recognition model is a video image, and the output data is the edge coordinate data of the hand region in the video image.

[0237] The edge coordinate acquisition submodule is used to input the current video image into the hand recognition model to obtain the edge coordinate data of the hand region in the current video image;

[0238] The hand image acquisition module is used to crop the current video image based on the edge coordinate data of the hand region to obtain at least one hand image containing the hand.

[0239] In one embodiment, the device further includes a model training module for training the hand recognition model, the model training module comprising:

[0240] The sample set acquisition submodule is used to acquire a hand image training sample set, which includes multiple hand image training samples, and each hand image training sample includes at least one gesture.

[0241] The recognition result acquisition submodule is used to sequentially input each hand image training sample into the hand recognition model to obtain the hand recognition result;

[0242] The loss value acquisition submodule is used to acquire the loss value between the hand recognition result and the labeled data of the hand image training samples;

[0243] The recognition model acquisition submodule is used to stop hand recognition model training in response to the loss value being less than or equal to a preset loss value threshold, thereby obtaining the hand recognition model.

[0244] In one embodiment, the hand image training sample set includes a publicly available subset of image training samples and / or a customized subset of image training samples, and the sample set acquisition submodule includes:

[0245] The image acquisition interface generation unit is used to generate a customized image acquisition interface, which includes an image acquisition frame, a hand type annotation frame, a shooting control, and a save control.

[0246] A customized image acquisition unit is used to display the captured customized image within the image acquisition frame in response to detecting that the shooting control is triggered;

[0247] The annotation data acquisition unit is used to acquire the annotation data input within the hand type annotation box;

[0248] An initial sample generation unit is configured to generate an initial sample of a customized image containing the customized image and the annotation data in response to detecting that the save control is triggered.

[0249] The sample subset acquisition unit is used to perform preset processing on the initial sample of the customized image to obtain multiple customized image samples, which are used as the training sample subset of the customized image.

[0250] The preset processing includes at least one of the following: size transformation, angle rotation, mosaic processing, and filtering processing.

[0251] In one embodiment, the gesture list acquisition module includes:

[0252] The history list update submodule is used to sort each hand image according to the size of the hand detection box in response to the gesture category not being empty, and update each hand detection box and its gesture category in the history gesture data list.

[0253] The gesture list acquisition submodule is used to update the control of the historical gesture data list to the hand image with the largest hand detection box size and lock the control, thereby obtaining the target gesture data list.

[0254] In one embodiment, the gesture list acquisition module includes:

[0255] The third result acquisition submodule is used to re-detect the video image to obtain the third hand detection result in response to the fact that the duration for which the gesture classification is empty and the gesture with control is not detected exceeds a preset duration.

[0256] The gesture list acquisition submodule is used to update the historical gesture data list based on the third hand detection result.

[0257] In one embodiment, the gesture list acquisition submodule includes:

[0258] The control release submodule is used to release control of the historical gesture data list in response to the third hand detection result being empty.

[0259] In one embodiment, the gesture list acquisition submodule includes:

[0260] The control update submodule is used to update the gesture category of the hand detection box in the historical gesture data list in response to the third hand detection result being non-empty; and update the control to the largest hand detection box.

[0261] In one embodiment, the gesture list acquisition module includes:

[0262] The control release submodule is used to update the gesture classification of the hand detection box in the historical gesture data list and release the control of the hand detection box in response to the fact that the gesture classification of the hand detection box with control in the current video image is different from the gesture classification of the previous frame video image in the historical gesture data list.

[0263] It should be noted that the apparatus shown in this embodiment matches the content of the method embodiment, and the content of the above method embodiment can be referred to, which will not be repeated here.

[0264] In some possible embodiments, an electronic device is provided, comprising:

[0265] Processor and memory;

[0266] The memory is used to store computer programs that can be executed by the processor;

[0267] The processor is configured to execute a computer program in the memory to implement the method described above.

[0268] In some possible embodiments, a non-transitory computer-readable storage medium is provided that, when an executable computer program in the storage medium is executed by a processor, enables the implementation of the methods described above.

[0269] The terminology used in this disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. Unless otherwise defined, the technical or scientific terms used in this disclosure should be understood in their ordinary sense by one of ordinary skill in the art to which this disclosure pertains. The words “a” or “one” and similar terms used in this disclosure and the claims do not indicate a limitation of quantity, but rather indicate the presence of at least one. “A plurality” means at least two. The words “comprising” or “including” and similar terms mean that the element or object preceding “comprising” or “including” covers the element or object listed following “comprising” or “including” and its equivalents, and does not exclude other elements or objects. The words “connected” or “linked” and similar terms are not limited to physical or mechanical connections and can include electrical connections, whether direct or indirect. The singular forms “a,” “the,” and “the” used in this disclosure and the appended claims are also intended to include the plural forms, unless the context clearly indicates otherwise. It should also be understood that the term “and / or” as used herein refers to and includes any or all possible combinations of one or more of the associated listed items.

[0270] For the method embodiments, since they basically correspond to the apparatus embodiments, the relevant parts can be referred to in the description of the apparatus embodiments. The method embodiments and apparatus embodiments complement each other.

[0271] The above description is merely a preferred embodiment of this disclosure and is not intended to limit this disclosure. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this disclosure should be included within the scope of protection of this disclosure.

Claims

1. A gesture recognition method, characterized in that, The method includes: The current video image is detected to obtain at least one hand image containing a hand; the at least one hand image is a hand image determined by performing hand detection on the current video image and performing hand detection on image sub-blocks after at least one segmentation of the current video image; and the number of image sub-blocks after the first segmentation decreases. Each hand image is identified to obtain a gesture classification; The historical gesture data list is updated based on the gesture classification to obtain the target gesture data list, which includes gesture location, gesture classification, and gesture control authority; the gesture control authority represents the gesture that needs to be executed in the current video image. The historical gesture data list is updated based on the gesture classification to obtain the target gesture data list, including: The hand images are sorted according to the size of the hand detection boxes, and the hand detection boxes and their gesture classifications in the historical gesture data list are updated. In response to the control being empty, the control in the historical gesture data list is updated to the hand image with the largest hand detection box size and the control is locked to obtain the target gesture data list; or, In response to a situation where the gesture classification of the hand detection box that has control in the current video image is different from the gesture classification of the previous frame video image in the historical gesture data list, control of the hand detection box is released.

2. The method according to claim 1, characterized in that, Detection is performed on the current video image to obtain at least one hand image containing a hand, including: Hand detection is performed on the current video image to obtain a first hand detection result; the first hand detection result includes a first detection result in which a hand is detected or a second detection result in which a hand is not detected; In response to determining that the first hand detection result is the second detection result, the current video image is divided into a preset number of image sub-blocks, and hand gesture detection is performed on each image sub-block to obtain the second hand detection result; Redundancy filtering is performed on the first hand detection result and the second hand detection result to obtain the hand detection box; Cropping the area containing the hand detection box in the current video image yields at least one hand image containing the gesture.

3. The method according to claim 1, characterized in that, Detection is performed on the current video image to obtain at least one hand image containing a hand, including: A hand recognition model is obtained, wherein the input data of the hand recognition model is a video image, and the output data is the edge coordinate data of the hand region in the video image; The current video image is input into the hand recognition model to obtain the edge coordinate data of the hand region in the current video image; The current video image is cropped based on the edge coordinate data of the hand region to obtain at least one hand image containing the hand.

4. The method according to claim 3, characterized in that, The hand recognition model is trained through the following steps: Obtain a set of hand image training samples, the set of hand image training samples includes multiple hand image training samples, each hand image training sample includes at least one gesture; Each hand image training sample is sequentially input into the hand recognition model to obtain the hand recognition result; Obtain the loss value between the hand recognition result and the labeled data of the hand image training samples; In response to the loss value being less than or equal to a preset loss value threshold, the training of the hand recognition model is stopped, and the hand recognition model is obtained.

5. The method according to claim 4, characterized in that, The hand image training sample set includes a publicly available subset of image training samples and / or a customized subset of image training samples. Obtaining the customized subset of image training samples from the hand image training sample set includes: Generate a customized image acquisition interface, which includes an image acquisition frame, a hand type label frame, a shooting control, and a save control; In response to the detection that the shooting control is triggered, the captured customized image is displayed in the image acquisition frame; Obtain the annotation data entered in the hand type annotation box; In response to the detection that the save control is triggered, an initial sample of the customized image is generated, containing the customized image and the annotation data; The initial sample of the customized image is subjected to preset processing to obtain multiple customized image samples, which are used as a subset of the customized image training samples. The preset processing includes at least one of the following: size transformation, angle rotation, mosaic processing, and filtering processing.

6. The method according to claim 1, characterized in that, The historical gesture data list is updated based on the gesture classification to obtain the target gesture data list, including: In response to the fact that the control right is empty and no gesture category is detected, or in response to the fact that the control right is not empty and the duration for which the gesture with control right is not detected exceeds a preset duration, the video image is re-detected to obtain a third hand detection result. The historical gesture data list is updated based on the third hand detection result.

7. The method according to claim 6, characterized in that, The historical gesture data list is updated based on the third hand detection result, including: In response to the third hand detection result being empty, control of the hand detection box in the historical gesture data list is released.

8. The method according to claim 6, characterized in that, The historical gesture data list is updated based on the third hand detection result, including: In response to the third hand detection result being non-empty, gesture classification is performed on the hand image, and the gesture classification of the hand detection box in the historical gesture data list is updated.

9. A gesture recognition device, characterized in that, The device includes: The hand image acquisition module is used to detect the current video image and obtain at least one hand image containing a hand; the at least one hand image is determined by performing hand detection on the current video image and performing hand detection on image sub-blocks after at least one segmentation of the current video image; and the number of image sub-blocks after the first segmentation decreases. The gesture classification acquisition module is used to identify each hand image and obtain a gesture classification. The gesture list acquisition module is used to update the historical gesture data list according to the gesture classification to obtain the target gesture data list. The target gesture data list includes gesture position, gesture classification, and gesture control right. The gesture control right represents the gesture that needs to be executed in the current video image. The gesture list acquisition module includes: The history list update submodule is used to sort each hand image according to the size of the hand detection box, and update each hand detection box and its gesture category in the history gesture data list; The gesture list acquisition submodule is used to update the control of the historical gesture data list to the hand image with the largest hand detection box size and lock the control in response to the gesture category being non-empty, thereby obtaining the target gesture data list; The gesture list acquisition module includes: The gesture classification update submodule is used to sort each hand image according to the size of the hand detection box and update each hand detection box and its gesture classification in the historical gesture data list. The control release submodule is used to release the control of the hand detection box in response to the fact that the gesture classification of the hand detection box with control in the current video image is different from the gesture classification of the previous frame video image in the historical gesture data list.

10. The apparatus according to claim 9, characterized in that, The hand image acquisition module includes: The first result detection module is used to perform hand detection on the current video image and obtain a first hand detection result; the first hand detection result includes a first detection result in which a hand is detected or a second detection result in which a hand is not detected; The second result detection module is used to, in response to determining that the first hand detection result is the second detection result, divide the current video image into a preset number of image sub-blocks, and perform hand gesture detection on each image sub-block to obtain the second hand detection result; The hand detection box acquisition module is used to perform redundancy filtering on the first hand detection result and the second hand detection result to obtain the hand detection box; The hand image acquisition module is used to crop the area where the hand detection box is located in the current video image to obtain at least one hand image containing a gesture.

11. The apparatus according to claim 9, characterized in that, The hand image acquisition module includes: The recognition model acquisition submodule is used to acquire a hand recognition model. The input data of the hand recognition model is a video image, and the output data is the edge coordinate data of the hand region in the video image. The edge coordinate acquisition submodule is used to input the current video image into the hand recognition model to obtain the edge coordinate data of the hand region in the current video image; The hand image acquisition module is used to crop the current video image based on the edge coordinate data of the hand region to obtain at least one hand image containing the hand.

12. The apparatus according to claim 11, characterized in that, The device further includes a model training module for training the hand recognition model, the model training module comprising: The sample set acquisition submodule is used to acquire a hand image training sample set, which includes multiple hand image training samples, and each hand image training sample includes at least one gesture. The recognition result acquisition submodule is used to sequentially input each hand image training sample into the hand recognition model to obtain the hand recognition result; The loss value acquisition submodule is used to acquire the loss value between the hand recognition result and the labeled data of the hand image training samples; The recognition model acquisition submodule is used to stop hand recognition model training in response to the loss value being less than or equal to a preset loss value threshold, thereby obtaining the hand recognition model.

13. The apparatus according to claim 12, characterized in that, The hand image training sample set includes a publicly available subset of image training samples and / or a customized subset of image training samples. The sample set acquisition submodule includes: The image acquisition interface generation unit is used to generate a customized image acquisition interface, which includes an image acquisition frame, a hand type annotation frame, a shooting control, and a save control. A customized image acquisition unit is used to display the captured customized image within the image acquisition frame in response to detecting that the shooting control is triggered; The annotation data acquisition unit is used to acquire the annotation data input within the hand type annotation box; An initial sample generation unit is configured to generate an initial sample of a customized image containing the customized image and the annotation data in response to detecting that the save control is triggered. The sample subset acquisition unit is used to perform preset processing on the initial sample of the customized image to obtain multiple customized image samples, which are used as the training sample subset of the customized image. The preset processing includes at least one of the following: size transformation, angle rotation, mosaic processing, and filtering processing.

14. The apparatus according to claim 9, characterized in that, The gesture list acquisition module includes: The third result acquisition submodule is used to re-detect the video image in response to the fact that the control right is empty and no gesture category is detected, or in response to the fact that the control right is not empty and the gesture with control right has not been detected for a duration exceeding a preset duration, to obtain the third hand detection result. The gesture list acquisition submodule is used to update the historical gesture data list based on the third hand detection result.

15. The apparatus according to claim 14, characterized in that, The gesture list retrieval submodule includes: The control release submodule is used to release the control of the hand detection box in the historical gesture data list in response to the third hand detection result being empty.

16. The apparatus according to claim 14, characterized in that, The gesture list retrieval submodule includes: The control update submodule is used to perform gesture classification on the hand image and update the gesture classification of the hand detection box in the historical gesture data list in response to the third hand detection result being non-empty.

17. An electronic device, characterized in that, include: Processor and memory; The memory is used to store computer programs that can be executed by the processor; The processor is configured to execute a computer program in the memory to implement the method as described in any one of claims 1 to 8.

18. A non-transitory computer-readable storage medium, characterized in that, When the executable computer program in the storage medium is executed by a processor, it can implement the method as described in any one of claims 1 to 8.