Method, apparatus and electronic device for determining an interaction gesture
By performing target component detection and local region gesture recognition on video frames in multi-user human-computer interaction scenarios, and combining the multi-frame detection results, the problems of accuracy and real-time performance of gesture recognition are solved, and efficient interactive gesture determination is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- ARASHI VISION INC
- Filing Date
- 2022-03-08
- Publication Date
- 2026-06-12
Smart Images

Figure CN114816045B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of human-computer interaction technology, and in particular to a method, apparatus and electronic device for determining interactive gestures. Background Technology
[0002] With the continuous development and popularization of product intelligence, electronification, and interconnectivity, many increasingly intelligent human-computer interaction methods have emerged, such as gesture control, to meet people's pursuit of personalization and fashion. Gesture control is a new type of human-computer interaction technology that uses a camera as an input device and leverages computer vision / image processing technology to recognize human gestures and translate them into control commands for the device. Gesture interaction overcomes the drawbacks of traditional interaction methods (mouse, keyboard, touchscreen, etc.) where users need to physically contact the input device, limiting their movement space and improving the flexibility of interaction.
[0003] In practical applications, the accuracy and real-time performance of gesture recognition are crucial for enabling gesture interaction. If the device cannot respond to user gestures in a timely manner, or frequently misrecognizes gestures and executes incorrect responses, it will severely affect the accuracy and real-time performance of gesture interaction, thus degrading the user experience. Summary of the Invention
[0004] The main technical problem addressed by the embodiments of this application is to provide a method, apparatus, and electronic device for determining interactive gestures, which can accurately and in real time determine interactive gestures in multi-user human-computer interaction scenarios.
[0005] To address the aforementioned technical problems, in a first aspect, embodiments of this application provide a method for determining interactive gestures, comprising:
[0006] Get the current video frame in the real-time video stream;
[0007] Perform target component detection on the current video frame to obtain the bounding boxes of N target components in the current video frame;
[0008] The bounding boxes of N target components are expanded to obtain N gesture detection boxes;
[0009] According to preset rules, select M target gesture detection boxes corresponding to the current video frame from N gesture detection boxes;
[0010] The hand component is located and the gesture is recognized for the M target gesture detection boxes corresponding to the current video frame to obtain the gesture detection result of the current video frame;
[0011] The interaction gestures are determined based on the gesture detection results of the current video frame and the gesture detection results of multiple historical video frames.
[0012] In some embodiments, the aforementioned selection of M target gesture detection boxes corresponding to the current video frame from N gesture detection boxes according to preset rules includes:
[0013] By backtracking through M target gesture detection boxes corresponding to multiple historical video frames, the M target gesture detection boxes corresponding to the current video frame are determined from N gesture detection boxes in a polling manner.
[0014] In some embodiments, the aforementioned process of tracing back M target gesture detection boxes corresponding to multiple historical video frames and determining the M target gesture detection boxes corresponding to the current video frame from N gesture detection boxes in a polling manner includes:
[0015] By tracing back the identifiers of the M target gesture detection boxes corresponding to multiple historical video frames, and polling the identifiers of the N gesture detection boxes, the M target gesture detection boxes corresponding to the current video frame are determined.
[0016] In some embodiments, the aforementioned process of tracing back M target gesture detection boxes corresponding to multiple historical video frames and determining the M target gesture detection boxes corresponding to the current video frame from N gesture detection boxes in a polling manner includes:
[0017] By tracing back the most recent detection time of the M target gesture detection boxes corresponding to multiple historical video frames, the M gesture detection boxes with the furthest recent detection time among the N gesture detection boxes are determined as the M target gesture detection boxes corresponding to the current video frame.
[0018] In some embodiments, the aforementioned selection of M target gesture detection boxes corresponding to the current video frame from N gesture detection boxes according to preset rules includes:
[0019] Take the k target gesture detection boxes that detected the gesture in the previous video frame of the current video frame as the k target gesture detection boxes corresponding to the current video frame, where k≤M;
[0020] The M target gesture detection boxes corresponding to multiple historical video frames are traced back, and the Mk target gesture detection boxes corresponding to the current video frame are determined by polling from the N gesture detection boxes excluding the k target gesture detection boxes.
[0021] In some embodiments, the aforementioned processing of hand component localization and gesture recognition for the M target gesture detection boxes corresponding to the current video frame to obtain the gesture detection result of the current video frame includes:
[0022] Perform hand component detection on the regions of the M target gesture detection boxes corresponding to the current video frame to obtain M hand component bounding boxes;
[0023] Obtain the overlap ratio between the bounding boxes of the M hand parts and the bounding boxes of the corresponding target parts;
[0024] Gesture recognition processing is performed on the bounding boxes of hand components with an overlap ratio less than or equal to the first threshold to obtain the gesture detection results of the current video frame.
[0025] In some embodiments, determining the interaction gesture based on the gesture detection results of the current video frame and the gesture detection results of multiple historical video frames includes:
[0026] If the same trigger gesture appears in the same location near the gesture detection results of the current video frame and multiple historical video frames, then the trigger gesture is determined to be an interactive gesture.
[0027] To address the aforementioned technical problems, in a second aspect, this application provides an interaction method, including:
[0028] The interactive gestures are determined using the method described in the first aspect.
[0029] Control the target device to execute the operation command corresponding to the interactive gesture.
[0030] To address the aforementioned technical problems, in a third aspect, embodiments of this application provide a device for determining interactive gestures, comprising:
[0031] The acquisition module is used to acquire the current video frame in the real-time video stream;
[0032] The target component detection module is used to detect target components in the current video frame and obtain the bounding boxes of N target components in the current video frame;
[0033] The expansion processing module is used to expand the bounding boxes of N target components to obtain N gesture detection boxes.
[0034] The selection module is used to select the M target gesture detection boxes corresponding to the current video frame from N gesture detection boxes according to preset rules.
[0035] The recognition module is used to locate the hand parts and perform gesture recognition processing on the M target gesture detection boxes corresponding to the current video frame to obtain the gesture detection result of the current video frame.
[0036] The determination module is used to determine the interaction gesture based on the gesture detection results of the current video frame and the gesture detection results of multiple historical video frames.
[0037] To address the aforementioned technical problems, in a fourth aspect, this application provides an electronic device, comprising:
[0038] At least one processor, and
[0039] A memory that is communicatively connected to at least one processor, wherein,
[0040] The memory stores instructions that can be executed by at least one processor, such that the instructions are executed by at least one processor to enable the at least one processor to perform the method in the first aspect.
[0041] To address the aforementioned technical problems, in a fifth aspect, embodiments of this application provide a computer-readable storage medium storing computer-executable instructions for causing a computer to perform the method in the first aspect.
[0042] The beneficial effects of this application's embodiments are as follows: Unlike the prior art, the method for determining interactive gestures provided in this application's embodiments acquires the current video frame in a real-time video stream, performs target component detection on the current video frame to obtain bounding boxes of N target components in the current video frame, and expands these bounding boxes to obtain N gesture detection boxes. Then, according to preset rules, M target gesture detection boxes corresponding to the current video frame are selected from the N gesture detection boxes, and hand component localization and gesture recognition processing are performed on these M target gesture detection boxes to obtain the gesture detection result of the current video frame. Finally, based on the gesture detection result of the current video frame and the gesture detection results of multiple historical video frames, the interactive gesture is determined.
[0043] This method first detects target parts (head, face, or head and shoulders), then performs gesture detection and recognition in local regions near the target parts (around the bounding boxes). Compared to full-image gesture detection and recognition, this reduces computational power, detection time, and improves real-time performance. Furthermore, the feature granularity of hand parts is relatively large in local regions near the target parts, making their features more prominent and improving gesture detection accuracy. Additionally, for N gesture detection boxes (representing multiple users) in the current video frame, M target gesture detection boxes are selected for hand part localization and gesture recognition, further reducing detection time and improving real-time performance. Based on this, interactive gestures are determined using the gesture detection results of the current video frame and the gesture detection results of multiple historical video frames. This involves selectively recognizing gestures from users in each frame across multiple consecutive video frames to determine interactive gestures. On one hand, compared to comprehensive detection of all users in each frame, this reduces computational power, time consumption, and improves real-time performance. On the other hand, combining detection results from multiple frames to determine interactive gestures improves the accuracy and stability of interactive gestures. Therefore, the above method can accurately and in real time determine interactive gestures in multi-user human-computer interaction scenarios. Attached Figure Description
[0044] One or more embodiments are illustrated by way of example with reference numerals in the accompanying drawings. These illustrations do not constitute a limitation on the embodiments. Elements with the same reference numerals in the drawings are denoted as similar elements. Unless otherwise stated, the figures in the drawings are not to be limited by scale.
[0045] Figure 1 This is a schematic diagram illustrating application scenarios of human-computer interaction achieved through gestures in some embodiments of this application;
[0046] Figure 2 This is a flowchart illustrating the method for determining interactive gestures in some embodiments of this application;
[0047] Figure 3 This is a schematic diagram of gestures in some embodiments of this application;
[0048] Figure 4 This is a schematic diagram of the bounding box on the video frame in some embodiments of this application;
[0049] Figure 5 This is a schematic diagram illustrating the determination of the target gesture detection box in some embodiments of this application;
[0050] Figure 6 This is a schematic diagram of the structure of a device for determining interactive gestures in some embodiments of this application;
[0051] Figure 7 This is a schematic diagram of the structure of an electronic device in some embodiments of this application. Detailed Implementation
[0052] The present application will now be described in detail with reference to specific embodiments. These embodiments will help those skilled in the art to further understand the present application, but do not limit the present application in any way. It should be noted that those skilled in the art can make several modifications and improvements without departing from the concept of the present application. These all fall within the protection scope of the present application.
[0053] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.
[0054] It should be noted that, unless there is a conflict, the various features in the embodiments of this application can be combined with each other, all of which are within the protection scope of this application. Furthermore, although functional modules are divided in the device schematic diagram and a logical order is shown in the flowchart, in some cases, the steps shown or described can be executed in a different order than the module division in the device or the order in the flowchart. In addition, the terms "first," "second," and "third" used herein do not limit the data or execution order, but only distinguish identical or similar items with essentially the same function and effect.
[0055] Unless otherwise defined, all technical and scientific terms used in this specification have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used in this specification is for the purpose of describing particular embodiments only and is not intended to limit the scope of this application. The term "and / or" as used in this specification includes any and all combinations of one or more of the associated listed items.
[0056] Furthermore, the technical features involved in the various embodiments of this application described below can be combined with each other as long as they do not conflict with each other.
[0057] Please see Figure 1 , Figure 1 This diagram illustrates an application scenario for human-computer interaction via gestures. The scenario includes an electronic device and at least one user. The electronic device 10 includes at least one camera 11, and at least one user is within the field of view of the camera 11. The camera 11 captures images or video streams. When the user wants to control the electronic device 10 to activate a specific function, they can make a gesture.
[0058] Among them, electronic device 10 can be a terminal device with computing processing capabilities, such as a gimbal camera with a camera, a television, an electronic photo frame, a game box, an unmanned aerial vehicle, a smart car or a smart camera, a smart video conferencing machine, etc.
[0059] The control electronic device 10 can be a device that controls functional components within the electronic device 10, which can be hardware or software modules. In one example, the electronic device 10 can be, but is not limited to, a smart camera. Controlling the smart camera can include, but is not limited to, controlling one or more functional components within the smart camera, such as a lens focus control module, a scene mode switching module, or a flash control module.
[0060] Specifically, taking electronic device 10 as an example of a smart camera, for instance, there are three users in front of the smart camera. The smart camera acquires a video stream, and each video frame in the video stream includes these three users. Users can control the smart camera to execute corresponding commands through gestures, realizing human-computer interaction to meet the user's shooting needs. For example, different gestures can be used to control the smart camera to turn on or off focusing, or different gestures can be used to change the scene mode, turn on or off target tracking, etc.
[0061] For example, when user A makes an "open palm" gesture, the smart camera recognizes the gesture through the video stream and executes the command to "start tracking user A's head." When user A makes a "thumbs up" gesture, the smart camera recognizes the gesture through the video stream and executes the command to "start focusing." Thus, users can intelligently control the smart camera with gestures without manually adjusting it, obtaining satisfactory group photos or videos. As another example, for a freely rotating gimbal-mounted smart camera, whoever makes the "start tracking" gesture first will be the one the camera begins tracking.
[0062] Understandably, when there is only one user in front of the smart camera, the user can control the smart camera to take satisfactory photos or videos by gesture control, without the need for an additional photographer. It is simple, convenient, and suitable for use in multiple scenarios.
[0063] Taking electronic device 10 as an example of a smart video conferencing machine that supports remote conferencing, for instance, there are 5 employees in front of the smart video conferencing machine, and they are having a remote meeting with a client. All 5 employees can raise their hands to speak. If employee B raises their hand to speak, the smart video conferencing machine will switch the meeting screen to the speaker, employee B, so that the client can watch employee B's speech.
[0064] Accuracy and real-time performance of gesture recognition are crucial for enabling gesture interaction. If a device cannot respond to user gestures promptly, or frequently misrecognizes gestures and executes incorrect responses, it will severely impact gesture interaction and degrade the user experience.
[0065] In the technical solutions known to the inventors of this application, a gesture recognition model is typically used to perform full-image gesture detection and recognition on the current video frame to determine the trigger gesture used to trigger the electronic device to execute the corresponding command. However, in practical use, this solution has some drawbacks. For example, gestures are not easy to recognize and track due to factors such as lighting conditions, image quality, and complex background interference, and misrecognition is difficult to avoid in single-frame detection. Full-image gesture detection and recognition requires significant computing power, and it is difficult to achieve real-time recognition results on some devices with low computing power, thus failing to meet the accuracy and speed requirements of device interaction. In addition, it is difficult to handle human-computer interaction scenarios with multiple users.
[0066] In view of this, some embodiments of this application provide a method for determining interactive gestures. The method involves acquiring the current video frame from a real-time video stream, performing target component detection on the current video frame to obtain bounding boxes for N target components, and expanding each of these bounding boxes to obtain N gesture detection boxes. Then, according to a preset rule, M target gesture detection boxes corresponding to the current video frame are selected from the N gesture detection boxes. Hand component localization and gesture recognition processing are then performed on these M target gesture detection boxes to obtain the gesture detection result for the current video frame. Finally, based on the gesture detection result of the current video frame and the gesture detection results from multiple historical video frames, the interactive gesture is determined.
[0067] This method first detects target parts (head, face, or head and shoulders), then performs gesture detection and recognition in local regions near the target parts (around the bounding boxes). Compared to full-image gesture detection and recognition, this reduces computational power, detection time, and improves real-time performance. Furthermore, the feature granularity of hand parts is relatively large in local regions near the target parts, making their features more prominent and improving gesture detection accuracy. Additionally, for N gesture detection boxes (representing multiple users) in the current video frame, M target gesture detection boxes are selected for hand part localization and gesture recognition, further reducing detection time and improving real-time performance. Based on this, interactive gestures are determined using the gesture detection results of the current video frame and the gesture detection results of multiple historical video frames. This involves selectively recognizing gestures from users in each frame across multiple consecutive video frames to determine interactive gestures. On one hand, compared to comprehensive detection of all users in each frame, this reduces computational power, time consumption, and improves real-time performance. On the other hand, combining detection results from multiple frames to determine interactive gestures improves the accuracy and stability of interactive gestures. Therefore, the above method can accurately and in real time determine interactive gestures in multi-user human-computer interaction scenarios.
[0068] The following provides a detailed description of the methods for determining interactive gestures in some embodiments of this application. Please refer to [link / reference]. Figure 2 The method S100 includes, but is not limited to, the following steps:
[0069] S10: Get the current video frame in the real-time video stream.
[0070] The video stream can be captured by a camera in the above application scenario embodiments. It is understood that the video stream includes multiple temporally consecutive video frames. The video frame corresponding to the current moment in the video stream is the current video frame. As time progresses, the current video frame continuously changes.
[0071] It is understandable that there can be one or more users in front of the camera. Therefore, the video frame may include the gestures of one user or the gestures of multiple users.
[0072] Understandably, gestures can be static gestures, such as... Figure 3 Some static gestures are shown, such as the palm gesture, the OK gesture, the thumbs-up gesture, the index finger gesture, or the fist gesture. These gestures are merely illustrative examples; it is understood that any shape or movement made by a user using their hand can be considered a gesture. No limitations are imposed on gestures in this application.
[0073] Understandably, a pre-set gesture library is in place. If user A's gesture belongs to this library and is stable, then it becomes an interactive gesture and triggers the electronic device to issue a corresponding command. If user A's gesture does not belong to the library, it will not trigger the electronic device to issue a command, meaning no interaction is possible.
[0074] S20: Detect target components in the current video frame and obtain the bounding boxes of N target components in the current video frame.
[0075] Because the hand (i.e., the hand itself) is relatively small compared to the human body, and is also typically small in video frames, it is difficult to track and recognize. To detect and recognize hand gestures more quickly and effectively, for the current video frame, target component detection processing is performed first, resulting in bounding boxes for N target components. It can be understood that N represents the number of users in the current video frame.
[0076] The target component is a part of a human limb, such as the head, face, or head and shoulders. These target components have obvious features and are easy to detect and identify. By identifying the target components, the pixel range in which the hand component may appear can be initially determined.
[0077] The target component detection process in this step can be performed by a pre-trained target component detection neural network. The current video frame is input into the trained target component detection neural network to perform target detection, and the position information of the target component can be obtained. The position information of the target component is represented by a bounding box.
[0078] Taking the head as an example, the current video frame is input into a trained head detection neural network, which then outputs the positions (i.e., bounding boxes) of each head on the current video frame. Figure 4 As shown, Figure 4The display shows the detection results on the current video frame. In this frame, each person's head is within a bounding box, which is a rectangle represented by parameters (x, y, w, h), where (x, y) represents the center of the bounding box, w represents the width, and h represents the height. These parameters can be understood to be determined with reference to the image coordinate system of the current video frame. The bounding box reflects the position of the target component.
[0079] Understandably, if there is one user in the current video frame, the bounding box of one target component will be detected; if there are multiple users in the current video frame, the bounding boxes of multiple target components will be detected. Figure 4 The example shown uses a video frame containing three users.
[0080] S30: Expand the bounding boxes of N target components to obtain N gesture detection boxes.
[0081] Based on human anatomy, it's known that the hand component is near the target component, such as below or above it. Therefore, searching for and identifying the hand component near the target component in a video frame, and thus recognizing the gesture, effectively reduces the detection and recognition of invalid regions. Invalid regions can be understood as areas where the hand component is unlikely to appear, such as background areas far from the target component, or the torso area far from the target component. Hand gesture detection in the region near the target component (the valid region) requires less computational power and has a fast calculation speed, which helps improve the real-time performance of the detection.
[0082] To obtain the effective region where the hand component is most likely to appear, the bounding boxes of each target component are expanded to obtain corresponding expanded bounding boxes. It can be understood that the pixel region of the video frame enclosed by this expanded bounding box is the pixel region where the hand component is most likely to appear.
[0083] It is understood that the expanded bounding box is obtained by expanding the bounding box. The specific expansion method can be set by those skilled in the art. For example, the four boundaries of the bounding box can be proportionally enlarged and expanded with the center (x, y) of the bounding box as the center. In some embodiments, the expansion can be performed outwards from the center (x, y) of the bounding box, and the resulting expanded bounding box can be circular or trapezoidal, etc., with no specific restrictions on its shape.
[0084] In some embodiments, when the target component is a human body, the bounding box (x, y, w, h) encloses the human body. Since the hand component will not appear below the waist when a person makes a gesture, this information can be used to construct an extended bounding box. The range of the extended bounding box can use the upper half (H / 2) of the bounding box as a reference, and then extend it to the left and right by a certain distance. For example, the width of the extended bounding box is 3W and the height is H / 2, thus constructing an extended bounding box with a height of 0.5H and a width of 3W.
[0085] In some embodiments, when the target component is the head and shoulders, the bounding box (x, y, w, h) encloses the head and shoulders. When making a gesture, the forearm must be raised above the elbow to make the gesture. Based on this characteristic, the entire bounding box can be used as a reference, and the height can be increased by 0.5H upwards and the width increased by W on both sides to construct an expanded bounding box with a height of 1.5H and a width of 3W.
[0086] In some embodiments, when the target component is the head / face region, the bounding box (x, y, w, h) encloses the head / face. Standard gestures will be located on both sides of the face and will not overlap with the face. Therefore, based on the bounding box, an expanded bounding box with a height of 4H and a width of 5H can be constructed by expanding it upward by H, downward by 2H, and left and right by 2W.
[0087] S40: Select the M target gesture detection boxes corresponding to the current video frame from the N gesture detection boxes according to the preset rules.
[0088] Understandably, when multiple users interact with an electronic device, typically only one user performs the gesture interaction during the interaction. For example, if users A, B, and C are in front of a smart camera, the first interaction might be performed by user A, and the second interaction might be performed by user C. That is, in the aforementioned N gesture detection frames, one of them might detect the triggered gesture, while the others might not detect the gesture or the gesture might not be a triggered gesture. If no gesture interaction is performed, then none of the aforementioned N gesture detection frames might detect the gesture or the gesture might not be a triggered gesture.
[0089] Before gesture detection, the algorithm cannot determine which of the N gesture detection boxes can detect gestures and which cannot. If gesture detection is performed on all N gesture detection boxes, the computational workload is large, resulting in excessive time consumption, slow response, and impact on real-time performance. Therefore, by setting preset rules, M target gesture detection boxes corresponding to the current video frame are selected from the N gesture detection boxes according to the preset rules. Hand component localization and gesture recognition processing are then performed on the M target gesture detection boxes, which can reduce detection time and improve real-time performance.
[0090] It is understandable that each video frame in a real-time video stream has a period of "current video frame". Therefore, for each video frame in the video stream, the method described in S20 to S40 above is used to select M target gesture detection boxes, i.e., selectively perform gesture detection. Because the camera captures video frames at a high frequency, such as 30 frames per second, the frame interval is short. Therefore, the coverage time of multiple consecutive video frames is also short; for example, for 10 consecutive video frames, the coverage time is less than 1 second. In real-world scenarios, the gestures of each user remain unchanged within this time. Since the M target gesture detection boxes corresponding to each video frame are not completely identical, it is possible to detect and cover N gesture detection boxes in different video frames. This is equivalent to being able to detect all N gesture detection boxes in these 10 video frames. Therefore, it not only reduces computational power consumption and improves detection speed and real-time performance, but also ensures detection accuracy and avoids missed detections due to omissions.
[0091] In some embodiments, the aforementioned step S40 specifically includes:
[0092] S41: Backtrack the M target gesture detection boxes corresponding to multiple historical video frames, and determine the M target gesture detection boxes corresponding to the current video frame from the N gesture detection boxes in a polling manner.
[0093] To detect and cover N gesture detection boxes in consecutive video frames, a polling approach is used to select N gesture detection boxes in turn for hand part localization and gesture recognition processing. Specifically, in these multiple historical video frames and the current video frame, M target gesture detection boxes are detected in each frame, ensuring that the M target gesture detection boxes detected in each frame are not completely identical, and that they cover N gesture detection boxes. By polling N gesture detection boxes in multiple historical video frames and the current video frame, this method not only reduces computational power consumption and improves detection speed and real-time performance, but also ensures detection accuracy and avoids missed detections due to omissions.
[0094] In some embodiments, the aforementioned step S41 specifically includes:
[0095] S411: Backtrack the identifiers of the M target gesture detection boxes corresponding to multiple historical video frames, and determine the M target gesture detection boxes corresponding to the current video frame by polling the identifiers of the N gesture detection boxes.
[0096] Here, the N gesture detection boxes are numbered. For example, if N is 5, there are 5 gesture detection boxes, which are numbered 1#, 2#, 3#, 4#, and 5# respectively. These 5 gesture detection boxes are selected in a round-robin fashion, with M target gesture detection boxes selected for gesture detection in each frame, for example, M is 2. To illustrate the round-robin method, if gesture detection boxes 1# and 2# are selected in frame t-2, and gesture detection boxes 3# and 4# are selected in frame t-1, then gesture detection boxes 5# and 1# are selected in frame t (the current video frame).
[0097] By numbering each gesture detection box and polling the number identifiers of N gesture detection boxes in a loop, the N gesture detection boxes are polled in groups of M, so that N gesture detection boxes can be detected and covered in consecutive video frames.
[0098] Understandably, as the number of users increases or decreases, the number N of gesture detection boxes changes dynamically. M can be a pre-set parameter. If N is greater than M, then M target gesture detection boxes are selected. If N is less than or equal to M, then N target gesture detection boxes are selected. That is to say, when N is less than or equal to M, all gesture detection boxes in the video frame are detected.
[0099] In some embodiments, the aforementioned step S41 specifically includes:
[0100] S412: Backtrack the most recent detection time of the M target gesture detection boxes corresponding to multiple historical video frames, and determine the M gesture detection boxes with the furthest recent detection time among the N gesture detection boxes as the M target gesture detection boxes corresponding to the current video frame.
[0101] Here, the "most recent detection time" of each gesture detection box is updated and recorded. When selecting M target gesture detection boxes, the gesture detection boxes that have not been detected are selected first, and then the gesture detection box with the earliest "most recent detection time" is selected. The selection proceeds from farthest to nearest, and the gesture detection box with the closest "most recent detection time" is selected in turn. For example, if there are 5 gesture detection boxes (N=5) and 2 polling slots (M=2), where gesture detection box 5# has not been detected, gesture detection boxes 3# and 4# were detected in frame t-1, and gesture detection boxes 1# and 2# were detected in frame t-2, then in frame t (the current video frame), gesture detection box 5# is selected first, and then gesture detection box 2# is randomly selected from gesture detection boxes 1# and 2#. Therefore, for frame t+1, gesture detection boxes 2# and 5# were detected in frame t, gesture detection boxes 3# and 4# were detected in frame t-1, and gesture detection box 1# was detected in frame t-2. The gesture detection boxes with the furthest recent detection time are 1# and 3#, or 1# and 4#. Therefore, it can be determined that the target gesture detection boxes for frame t+1 are 1# and 3#, or 1# and 4#.
[0102] In this embodiment, by updating the "recent detection time" of each gesture detection box and polling the "recent detection time" of N gesture detection boxes, the gesture detection boxes are selected from far to near according to the "recent detection time". The selection is carried out in a rolling polling manner, so that the target gesture detection boxes of consecutive video frames can cover N gesture detection boxes.
[0103] In some embodiments, the aforementioned step S40 specifically includes:
[0104] S42: Take the k target gesture detection boxes that detected gestures in the previous video frame of the current video frame as the k target gesture detection boxes corresponding to the current video frame, where k≤M.
[0105] S43: Backtrack the M target gesture detection boxes corresponding to multiple historical video frames, and determine the Mk target gesture detection boxes corresponding to the current video frame from the gesture detection boxes other than the k target gesture detection boxes in the N gesture detection boxes in a polling manner.
[0106] It is understandable that cameras capture video frames at a high frequency, such as 30 frames per second, with short frame intervals. Therefore, the time covered by multiple consecutive video frames is also short; for example, 10 consecutive video frames cover less than 1 second. In real-world scenarios, the gestures of different users remain unchanged within this time frame, meaning that the user who triggers the gesture within a short period is the same user. Therefore, it is assumed that if the region (gesture detection box) where a gesture was detected in the previous video frame has a high probability of detecting a gesture again, it can be prioritized. Specifically, K slots are allocated from the M target gesture detection boxes to lock onto the gesture detection boxes that detected the gesture in the previous video frame. The remaining Mk slots are selected from the N gesture detection boxes excluding the k target gesture detection boxes using the aforementioned polling method. Here, the polling method can be the numbered identifier polling in step S411 above, or the polling from the most recent detection time in step S412 above.
[0107] If no gesture was detected in the previous video frame, there is no need to allocate K locking slots. The process can be carried out by polling according to the number identifier in step S411, or by polling from the most recent detection time in step S412.
[0108] If more than k gestures are detected in the previous video frame, k gestures can be randomly selected, or the top k with high detection confidence or the top k with large area can be selected as the k target gesture detection boxes for the current video frame. The remaining Mk slots are selected from the N gesture detection boxes excluding the k target gesture detection boxes in the polling method described above.
[0109] like Figure 5As shown in the illustration, with M=2 and k=1, if no gesture is detected in the first and second video frames, then the first, second, and third video frames use a polling method to determine M target gesture detection boxes. If a gesture is detected in gesture detection box 1# in the third video frame, then the target gesture detection box in the fourth video frame is locked to gesture detection box 1#, and polling continues to gesture detection box 2#. Gesture detection boxes 1# and 2# are used as the target gesture detection boxes for the fourth video frame. If a gesture is detected in gesture detection box 1# in the fourth video frame, then the target gesture detection box in the fifth video frame is locked to gesture detection box 1#, and polling continues to gesture detection box 3#. Gesture detection boxes 1# and 3# are used as the target gesture detection boxes for the fifth video frame.
[0110] In this embodiment, the target gesture detection box of the current video frame is determined by the above-mentioned locking and polling method. This can not only prioritize the processing of gesture detection boxes with a high probability of detecting gestures, thereby improving the detection speed and real-time performance, but also cover other gesture detection boxes to avoid omissions and false detections, thus improving the detection accuracy.
[0111] S50: Perform hand component localization and gesture recognition processing on the M target gesture detection boxes corresponding to the current video frame to obtain the gesture detection result of the current video frame.
[0112] A pre-trained gesture detection model can be used to locate and classify gestures within the pixel regions of M target gesture detection boxes in the current video frame, yielding the gesture detection result for the current video frame. Here, the gesture detection result includes the position of the hand component within the M target gesture detection boxes and the corresponding gesture category. It is understood that the gesture detection model can be trained using a convolutional neural network. The application and use of convolutional neural networks are standard techniques in the algorithm field and will not be described in detail here.
[0113] In some embodiments, the aforementioned step S50 specifically includes:
[0114] S51: Perform hand component detection on the regions of the M target gesture detection boxes corresponding to the current video frame to obtain M hand component bounding boxes.
[0115] After obtaining M target gesture detection boxes for the current video frame, hand component detection is performed on the regions within these M target gesture detection boxes to obtain M hand component bounding boxes. Here, hand component detection involves identifying which pixels within the M target gesture detection boxes on the video frame are hand components; that is, detecting and locating each hand component within the M target gesture detection boxes, with the hand component bounding box enclosing the pixels of the hand component. The hand component bounding box can also be represented using center coordinates and width and height (x, y, w, h), meaning the hand component bounding box represents the position of the hand component.
[0116] S52: Obtain the overlap ratio between the bounding boxes of M hand parts and the bounding boxes of their corresponding target parts.
[0117] It is understandable that the target component (such as the user's face) corresponds to the hand component, and therefore, the bounding box of the target component also corresponds to the bounding box of the hand component. Therefore, the overlap ratio between the bounding box of each hand component and the bounding box of the corresponding target component is obtained.
[0118] Since the hand component is similar in color to the face, it is difficult to obtain the correct gesture classification result for the hand component that overlaps with the face. Therefore, discarding this hand component and abandoning gesture detection can avoid obtaining incorrect gestures.
[0119] S52: Perform gesture recognition processing on the bounding boxes of hand components with an overlap ratio less than or equal to the first threshold to obtain the gesture detection result of the current video frame.
[0120] Hand component detection boxes with an overlap ratio less than or equal to a first threshold are selected. Gesture recognition processing is then performed on the hand components within these detection boxes to obtain the corresponding gestures. It can be understood that hand component detection boxes with an overlap ratio less than or equal to the first threshold and their corresponding gestures constitute the gesture detection result for the current video frame.
[0121] In this embodiment, by selecting hand component detection boxes with an overlap ratio less than or equal to a first threshold for gesture recognition, the accuracy of gesture detection results can be improved.
[0122] S60: Determine the interaction gesture based on the gesture detection results of the current video frame and the gesture detection results of multiple historical video frames.
[0123] Here, the interaction gesture is determined based on the gesture detection results of the current video frame and the gesture detection results of multiple historical video frames. This is equivalent to selectively recognizing the user's gestures in each frame of multiple consecutive historical video frames to determine the interaction gesture.
[0124] Each frame of a real-time video stream represents a period known as the "current video frame." Therefore, for the aforementioned multiple historical video frames, the method described in S20 to S50 is used to select M target gesture detection boxes for gesture detection. Due to the high frequency of video frame capture by the camera (e.g., 30 frames per second), the frame interval is short. Therefore, the time covered by the current video frame and multiple historical video frames is also short; for example, for 10 consecutive video frames, the coverage time is less than 1 second. In real-world scenarios, each user's gesture remains constant within this timeframe. Since the M target gesture detection boxes corresponding to each video frame are not entirely identical, it is possible to detect and cover N gesture detection boxes in different video frames. This means that in these 10 video frames, all N gesture detection boxes can be detected. This not only reduces computational power consumption and improves detection speed and real-time performance but also ensures detection accuracy and avoids missed detections due to omissions.
[0125] In other words, compared to performing comprehensive detection on each user in every frame, it can reduce computing power, reduce time consumption, and improve real-time performance. Furthermore, combining the detection results of multiple frames to determine the interaction gesture can improve the accuracy and stability of the interaction gesture.
[0126] In some embodiments, the aforementioned step S60 specifically includes:
[0127] S61: If the same trigger gesture appears in the same location near the gesture result of the current video frame and the gesture detection results of multiple historical video frames, then the trigger gesture is determined to be an interactive gesture.
[0128] Here, multiple historical video frames can be 3 or 4 video frames preceding the current video frame, and the number of historical video frames is not limited. It is understood that each of the multiple historical video frames corresponds to a gesture detection result. If the same trigger gesture appears near the same location in the current video frame and multiple historical video frames, it indicates that the trigger gesture is a gesture the user wants to display for control, and it can be determined that the trigger gesture is an interactive gesture.
[0129] For example, if a trigger gesture is detected near the same location in three consecutive frames, such as the trigger gesture category being "OK" gesture, then the trigger gesture is considered to have been successfully identified and is determined to be an interactive gesture.
[0130] In summary, the method for determining interactive gestures provided in this application involves acquiring the current video frame in a real-time video stream, performing target component detection on the current video frame to obtain bounding boxes of N target components in the current video frame, and expanding these bounding boxes to obtain N gesture detection boxes. Then, according to preset rules, M target gesture detection boxes corresponding to the current video frame are selected from the N gesture detection boxes, and hand component localization and gesture recognition processing are performed on these M target gesture detection boxes to obtain the gesture detection result of the current video frame. Finally, based on the gesture detection result of the current video frame and the gesture detection results of multiple historical video frames, the interactive gesture is determined. In this method, target component (head, face, or head plus shoulders) detection is performed first, and then gesture detection and recognition are performed in the local region near the target component (around the bounding box). Compared with full-image gesture detection and recognition, this reduces computational power, decreases detection time, and improves real-time performance. Furthermore, the feature granularity of the hand component is relatively large in the local region near the target component, making the hand component features more prominent and improving the accuracy of gesture detection. Furthermore, for N gesture detection boxes (representing multiple users) in the current video frame, selecting M target gesture detection boxes for hand component localization and gesture recognition can reduce detection time and improve real-time performance. Based on this, interactive gestures are determined according to the gesture detection results of the current video frame and the gesture detection results of multiple historical video frames. That is, gesture recognition is selectively performed on users in each frame across multiple consecutive video frames to determine interactive gestures. On the one hand, compared to comprehensive detection of each user in every frame, this reduces computational power, decreases time consumption, and improves real-time performance. On the other hand, combining the detection results of multiple frames to determine interactive gestures improves the accuracy and stability of interactive gestures. Therefore, using the above method, interactive gestures can be accurately and in real-time determined in multi-user human-computer interaction scenarios.
[0131] This application also provides an interaction method, which includes:
[0132] (1) The interactive gesture is determined by the method of determining the interactive gesture as described in any of the above embodiments.
[0133] (2) Control the target device to execute the operation command corresponding to the interactive gesture.
[0134] When a user wants to control a device to enable a certain function, they can make a gesture. The device identifies and determines the gesture using the method for determining interactive gestures in any of the above embodiments. This device can be referred to as the target device, and controlling the target device can be controlling a functional component within the device. This functional component can be a hardware or software module. In one example, the target device may include, but is not limited to, a smart camera. Controlling the smart camera may include, but is not limited to, controlling one or more functional components within the smart camera, such as a lens focus control module, a scene mode switching module, or a flash control module.
[0135] Understandably, interactive gestures correspond one-to-one with operation commands. For example, when user A makes a "palm open" gesture, the smart camera recognizes the gesture and executes the operation command "start tracking user A's head component". When user A makes a "thumbs up" gesture, the smart camera recognizes the gesture and executes the operation command "start focusing".
[0136] Therefore, by controlling the target device to execute the operation command corresponding to the interactive gesture, the user can perform intelligent interactive control in front of the target device without manually adjusting it.
[0137] The methods for determining interactive gestures and interaction methods in the embodiments of this application have been described above. In order to better implement the methods of this application, the apparatus provided in the embodiments of this application will be described next.
[0138] Please see Figure 6 , Figure 6 This application provides a device for determining interactive gestures. The device 200 includes an acquisition module 210, a target component detection module 220, an expansion processing module 230, a selection module 240, a recognition module 250, and a determination module 260.
[0139] The acquisition module 210 is used to acquire the current video frame in the real-time video stream. The target component detection module 220 is used to detect target components in the current video frame and obtain the bounding boxes of N target components in the current video frame. The expansion processing module 230 is used to expand the bounding boxes of the N target components to obtain N gesture detection boxes. The selection module 240 is used to select M target gesture detection boxes corresponding to the current video frame from the N gesture detection boxes according to preset rules. The recognition module 250 is used to perform hand component localization and gesture recognition processing on the M target gesture detection boxes corresponding to the current video frame to obtain the gesture detection result of the current video frame. The determination module 260 is used to determine the interactive gesture based on the gesture detection result of the current video frame and the gesture detection results of multiple historical video frames.
[0140] The aforementioned device 200 acquires the current video frame from a real-time video stream, performs target component detection on the current video frame, obtains bounding boxes for N target components in the current video frame, and expands these bounding boxes to obtain N gesture detection boxes. Then, according to preset rules, it selects M target gesture detection boxes corresponding to the current video frame from the N gesture detection boxes, performs hand component localization and gesture recognition processing on these M target gesture detection boxes, and obtains the gesture detection result for the current video frame. Finally, based on the gesture detection result of the current video frame and the gesture detection results of multiple historical video frames, it determines the interactive gesture. In this device, target component (head, face, or head plus shoulders) detection is performed first, and then gesture detection and recognition are performed in the local area near the target component (around the bounding box). Compared with full-image gesture detection and recognition, this reduces computational power, decreases detection time, and improves real-time performance. In addition, the feature granularity of the hand component is relatively large in the local area near the target component, making the hand component features more obvious and promoting the improvement of gesture detection accuracy. Furthermore, for N gesture detection boxes (representing multiple users) in the current video frame, selecting M target gesture detection boxes for hand component localization and gesture recognition processing can reduce detection time and improve real-time performance. Based on this, interactive gestures are determined according to the gesture detection results of the current video frame and the gesture detection results of multiple historical video frames. That is, gesture recognition is selectively performed on users in each frame across multiple consecutive video frames to determine interactive gestures. On the one hand, compared to performing comprehensive detection on each user in every frame, this reduces computational power, decreases time consumption, and improves real-time performance. On the other hand, combining the detection results of multiple frames to determine interactive gestures improves the accuracy and stability of interactive gestures. Therefore, using the aforementioned device 200, interactive gestures can be accurately and in real-time determined in multi-user human-computer interaction scenarios.
[0141] Please see Figure 7 This is a hardware structure diagram of an electronic device 10 provided in an embodiment of this application. Specifically, as shown... Figure 7 As shown, the electronic device 10 includes at least one processor 12 and a memory 13 connected in communication. Figure 7 (Taking a bus connection and a single processor as an example).
[0142] The processor 12 is used to provide computing and control capabilities to control the electronic device 10 to perform corresponding tasks and to control the electronic device 10 to perform any of the methods for determining interactive gestures provided in the above embodiments.
[0143] It is understood that processor 12 can be a general-purpose processor, including a central processing unit (CPU), a network processor (NP), etc.; it can also be a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components.
[0144] The memory 13, as a non-transitory computer-readable storage medium, can be used to store non-transitory software programs, non-transitory computer-executable programs, and modules, such as program instructions / modules corresponding to the method for determining interactive gestures or interaction methods in the embodiments of the present invention. The processor 12 can implement any of the methods for determining interactive gestures or interaction methods provided in the above embodiments by running the non-transitory software programs, instructions, and modules stored in the memory 13. Specifically, the memory 13 may include high-speed random access memory and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, the memory 13 may also include memory remotely located relative to the processor, and these remote memories can be connected to the processor via a network. Examples of such networks include, but are not limited to, the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof.
[0145] It is understood that, in some embodiments, the electronic device may be a smart device such as a smart camera, a mobile terminal, or a drone.
[0146] This application also provides a computer-readable storage medium storing a computer program, the computer program including program instructions, which, when executed by a computer, cause the computer to perform the aforementioned method for determining interactive gestures or interaction methods.
[0147] It should be noted that the device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate, and the components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs.
[0148] Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented using software and a general-purpose hardware platform, or of course, using hardware. Those skilled in the art will understand that all or part of the processes in the above embodiments can be implemented by a computer program instructing related hardware. The program can be stored in a computer-readable storage medium, and when executed, it can include the processes of the embodiments of the above methods. The storage medium can be a magnetic disk, optical disk, read-only memory (ROM), or random access memory (RAM), etc.
[0149] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of this application, and not to limit them; under the concept of this application, the technical features of the above embodiments or different embodiments can also be combined, the steps can be implemented in any order, and there are many other variations of different aspects of this application as described above, which are not provided in detail for the sake of brevity; although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that they can still modify the technical solutions described in the foregoing embodiments, or make equivalent substitutions for some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the scope of the technical solutions of the embodiments of this application.
Claims
1. A method for determining interactive gestures, characterized in that, include: Get the current video frame in the real-time video stream; Target component detection is performed on the current video frame to obtain the bounding boxes of N target components in the current video frame; The bounding boxes of the N target components are expanded to obtain N gesture detection boxes; According to preset rules, select M target gesture detection boxes corresponding to the current video frame from the N gesture detection boxes; The hand component is located and the gesture is recognized for the M target gesture detection boxes corresponding to the current video frame to obtain the gesture detection result of the current video frame, where N and M are positive integers and M is not greater than N; Based on the gesture detection results of the current video frame and the gesture detection results of multiple historical video frames, the interactive gesture is determined; The step of selecting M target gesture detection boxes corresponding to the current video frame from the N gesture detection boxes according to preset rules includes: By tracing back M target gesture detection boxes corresponding to multiple historical video frames, the M target gesture detection boxes corresponding to the current video frame are determined from the N gesture detection boxes in a polling manner.
2. The method according to claim 1, characterized in that, The process of retracing the M target gesture detection boxes corresponding to multiple historical video frames and determining the M target gesture detection boxes corresponding to the current video frame from the N gesture detection boxes in a polling manner includes: By tracing back the identifiers of the M target gesture detection boxes corresponding to multiple historical video frames, and polling the identifiers of the N gesture detection boxes, the M target gesture detection boxes corresponding to the current video frame are determined.
3. The method according to claim 1, characterized in that, The process of retracing the M target gesture detection boxes corresponding to multiple historical video frames and determining the M target gesture detection boxes corresponding to the current video frame from the N gesture detection boxes in a polling manner includes: By tracing back the most recent detection time of M target gesture detection boxes corresponding to multiple historical videos, the M gesture detection boxes with the furthest recent detection time among the N gesture detection boxes are determined as the M target gesture detection boxes corresponding to the current video frame.
4. The method according to claim 1, characterized in that, The step of selecting M target gesture detection boxes corresponding to the current video frame from the N gesture detection boxes according to preset rules includes: The k target gesture detection boxes that detect gestures in the previous video frame of the current video frame are taken as the k target gesture detection boxes corresponding to the current video frame, where k≤M; By backtracking multiple historical video frames and identifying M target gesture detection boxes, the Mk target gesture detection boxes corresponding to the current video frame are determined from the N gesture detection boxes excluding the k target gesture detection boxes in a polling manner.
5. The method according to any one of claims 1-4, characterized in that, The step of locating and recognizing hand components in the M target gesture detection boxes corresponding to the current video frame to obtain the gesture detection result of the current video frame includes: Hand component detection is performed on the regions of the M target gesture detection boxes corresponding to the current video frame to obtain M hand component bounding boxes; Obtain the overlap ratio between the bounding boxes of the M hand components and the bounding boxes of the corresponding target components; Gesture recognition processing is performed on the bounding boxes of hand components with an overlap ratio less than or equal to a first threshold to obtain the gesture detection result of the current video frame.
6. The method according to any one of claims 1-4, characterized in that, The step of determining the interactive gesture based on the gesture detection results of the current video frame and the gesture detection results of multiple historical video frames includes: If the same trigger gesture appears in the same location near the gesture result of the current video frame and the gesture detection result of the backtracking of multiple historical video frames, then the trigger gesture is determined to be an interactive gesture.
7. An interaction method, characterized in that, include: The interactive gesture is determined using the method for determining interactive gestures as described in any one of claims 1-6; Control the target device to execute the operation command corresponding to the interactive gesture.
8. A device for determining interactive gestures, characterized in that, include: The acquisition module is used to acquire the current video frame in the real-time video stream; The target component detection module is used to detect target components in the current video frame and obtain the bounding boxes of N target components in the current video frame. An expansion processing module is used to expand the bounding boxes of the N target components respectively to obtain N gesture detection boxes; The selection module is used to select M target gesture detection boxes corresponding to the current video frame from the N gesture detection boxes according to preset rules. The recognition module is used to locate the hand parts and perform gesture recognition processing on the M target gesture detection boxes corresponding to the current video frame to obtain the gesture detection result of the current video frame, where N and M are positive integers and M is not greater than N; The determination module is used to determine the interaction gesture based on the gesture detection results of the current video frame and the gesture detection results of multiple historical video frames. The step of selecting M target gesture detection boxes corresponding to the current video frame from the N gesture detection boxes according to preset rules includes: By tracing back M target gesture detection boxes corresponding to multiple historical video frames, the M target gesture detection boxes corresponding to the current video frame are determined from the N gesture detection boxes in a polling manner.
9. An electronic device, characterized in that, include: At least one processor, and The memory communicatively connected to the at least one processor, wherein, The memory stores instructions executable by the at least one processor, which, when executed by the at least one processor, enables the at least one processor to perform the method according to any one of claims 1-6.
10. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores computer-executable instructions for causing a computer to perform the method as described in any one of claims 1-6.