Vision-based method and system for human motion interaction with a robot dog

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By equipping the robot dog with a binocular vision module and an action recognition network, calculating the absolute distance and constructing a dynamic scale factor, the problem of poor interaction effect of the robot dog under monocular vision is solved, and high-precision user action recognition and interaction control are achieved.

CN122244939APending Publication Date: 2026-06-19SHENZHEN WANSHI RUYI INNOVATION TECHNOLOGY CO LTD

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: SHENZHEN WANSHI RUYI INNOVATION TECHNOLOGY CO LTD
Filing Date: 2026-02-26
Publication Date: 2026-06-19

Application Information

Patent Timeline

26 Feb 2026

Application

19 Jun 2026

Publication

CN122244939A

IPC: G06V40/20; G06V40/10; G06V20/40; G06V10/80; G06V10/82; G06V10/764; G06V10/774; G06V10/75; G06T5/70; G06T7/20

AI Tagging

Application Domain

Image enhancement Image analysis

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Traditional vision-based robot dog human motion interaction methods rely on a monocular camera, which lacks depth information, resulting in poor robustness to occlusion and interference, making it difficult to accurately calculate the absolute distance between the user and the robot dog, thus affecting the interaction effect.

Method used

A binocular vision module mounted on the robot dog's torso is used to acquire environmental video streams. Through parallel inference of distortion correction and pose estimation networks, the disparity values and initial confidence of skeletal key points are calculated. Combined with an action recognition network, a dynamic scale factor is constructed to generate three-dimensional skeletal key point coordinates and perform spatial normalization processing. The displacement vector field is extracted and nonlinearly amplified to output the action category.

Benefits of technology

It effectively eliminates the impact of changes in the relative distance between the user and the robot dog on action recognition, improves the robustness and accuracy of three-dimensional skeletal coordinates, and realizes high-precision real-time perception and interactive control of the robot dog in complex environments.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122244939A_ABST

Patent Text Reader

Abstract

This invention relates to the field of artificial intelligence technology and discloses a vision-based method and system for human-computer interaction in a robot dog. The method includes: acquiring the user's skeletal key points and their corresponding initial confidence scores; calculating the disparity values of the left and right views of the skeletal key points to extract target skeletal key points, and weighting and fusing the target skeletal key points with the initial confidence scores to generate a three-dimensional skeletal key point coordinate sequence for the user; performing spatial normalization on the three-dimensional skeletal key point coordinates to obtain normalized coordinates; inputting the normalized coordinates into a motion recognition network, calculating the displacement vector field of the target skeletal key points through the motion feature enhancement layer of the motion recognition network, and nonlinearly amplifying the displacement vector field to output the category confidence score of the user's action category; and generating and executing corresponding motion control commands when the category confidence score is greater than a preset confidence threshold. This invention can improve the interaction effect between the robot dog and the user.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to a vision-based robotic dog human motion interaction method and system, belonging to the field of artificial intelligence technology. Background Technology

[0002] Vision-based human-machine interaction in robotic dogs refers to an intelligent interaction mode in which robotic dogs use onboard visual sensors to perceive their environment, capture, identify, and understand the user's human posture and movement intentions in real time through computer vision algorithms, and then make autonomous decisions and give corresponding feedback. Its significance lies in breaking through the distance and semantic limitations of traditional remote control and voice interaction, giving robots human-like non-contact perception capabilities and empathetic interaction experiences, greatly improving the naturalness, convenience, and immersion of human-machine collaboration, and is a key technological path to realize the transformation of robotic dogs from automated tools to intelligent partners.

[0003] Traditional vision-based robot dog human motion interaction methods usually rely on a monocular camera to directly acquire two-dimensional images, and use a general pose estimation model to directly extract skeletal points and perform simple spatiotemporal feature classification. The drawback is that monocular vision lacks depth information, making it difficult to accurately calculate the absolute distance between the user and the robot dog, resulting in poor robustness to occlusion and interference, thus leading to poor interaction between the robot dog and the user. Summary of the Invention

[0004] This invention provides a vision-based robot dog human motion interaction method and system, the main purpose of which is to improve the interaction effect between the robot dog and the user.

[0005] To achieve the above objectives, the present invention provides a vision-based human motion interaction method for robotic dogs, comprising:

[0006] The robot dog acquires a video stream of the user's environment by using a binocular vision module mounted on its torso, and performs distortion correction on the video stream to obtain an image frame sequence. The left and right views of the image frame sequence are respectively input into the pose estimation network for parallel inference to obtain the user's skeletal key points and the corresponding initial confidence scores; Calculate the disparity values of the left and right views of the skeletal key points to extract the target skeletal key points of the skeletal key points, and perform weighted fusion of the target skeletal key points with the initial confidence as the weight to generate the user's three-dimensional skeletal key point coordinate sequence. Based on the three-dimensional skeleton key point coordinate sequence, the absolute distance between the user and the robot dog is calculated. A dynamic scale factor is constructed based on the preset action recognition network and the absolute distance to perform spatial normalization processing on the three-dimensional skeleton key point coordinates to obtain normalized coordinates. The normalized coordinates are input into the action recognition network. The motion feature enhancement layer of the action recognition network calculates the displacement vector field of the target skeletal key points and performs nonlinear amplification on the displacement vector field to output the category confidence of the user's action category. When the category confidence is greater than a preset confidence threshold, the corresponding motion control command is generated and executed.

[0007] Optionally, the left and right views of the image frame sequence are respectively input into the pose estimation network for parallel inference to obtain the user's skeletal keypoints and corresponding initial confidence scores, including: The high-dimensional semantic features of the left and right views are extracted using the feature extraction backbone network in the pose estimation network. The high-dimensional semantic features are input into the spatial attention module of the pose estimation network to generate a first attention mask for the user's human torso region. The high-dimensional semantic features are then weighted using the first attention mask to obtain weighted high-dimensional semantic features. The weighted high-dimensional semantic features are concatenated by the multi-scale feature fusion layer in the pose estimation network to generate the human body heatmap features of the left and right views; Using the aforementioned human body heatmap features, the user's skeletal key points and corresponding initial confidence levels are analyzed.

[0008] Optionally, the user's skeletal key points and corresponding initial confidence levels are analyzed using the human body heatmap features, including: Based on the human body heatmap features, peak coordinates are located in a preset set of skeletal key point categories using a non-maximum suppression algorithm, and the peak coordinates are mapped back to the original image coordinate system to obtain the user's skeletal key points. The response values of the skeletal key points in the human body heat map are read, and the response values are normalized to obtain the initial confidence level of the skeletal key points.

[0009] Optionally, the binocular vision module includes a left-eye camera and a right-eye camera, the left-eye camera and the right-eye camera are arranged horizontally parallel, the optical axis distance between the left-eye camera and the right-eye camera is less than the width of the robot dog's torso, and the shooting fields of the left-eye camera and the right-eye camera form an overlapping field of view in front of the robot dog.

[0010] Optionally, calculating the disparity values of the left and right views of the skeletal keypoints includes: Map the skeletal key points of the left view to the right view of the left and right views to generate the corresponding horizontal epipolar lines. Based on the geometric constraints of the stereo epipolar line, multiple candidate feature points for the right view are extracted within the preset search window of the horizontal epipolar line, and the matching cost between the candidate feature points for the right view and the skeletal key points in the left view is calculated. The candidate feature point for the right view with the smallest matching cost is selected as the final matching point for the right view. Calculate the absolute difference between the x-coordinate of the skeletal key point in the left view and the x-coordinate of the matching point in the right view, and use the absolute difference as the disparity value of the skeletal key point.

[0011] Optionally, the target skeletal keypoints are weighted and fused using the initial confidence level as weight to generate a three-dimensional skeletal keypoint coordinate sequence for the user, including: The initial confidence level and the target skeletal key points are input into a preset temporal sliding window to obtain a set of key points to be fused. Based on the initial confidence distribution, the three-dimensional coordinates of the set of key points to be fused are fused using a weighted average algorithm to generate the initial three-dimensional skeleton key point coordinates. The calculation formula for the weighted average algorithm is as follows:

[0012] in, This represents the initial coordinates of the key points of the 3D skeleton. This represents the total number of historical moments corresponding to the time-series sliding window. Indicates the first The initial confidence level of the target skeleton key points at each time step. Indicates the first The three-dimensional coordinates of key points of the target skeleton at each moment; The initial 3D skeleton key point coordinates are optimized to obtain the final 3D skeleton key point coordinate sequence.

[0013] Optionally, the absolute distance between the user and the robot dog is calculated based on the three-dimensional skeletal keypoint coordinate sequence, including: Select preset human torso key points from the three-dimensional skeletal key point coordinate sequence; Calculate the average three-dimensional coordinates of the key points of the human torso in the camera coordinate system, and determine the average three-dimensional coordinates as the center point coordinates of the user center point; Calculate the Euclidean distance between the center point coordinates and the origin of the camera coordinate system, and determine the Euclidean distance as the absolute distance between the user and the robot dog.

[0014] Optionally, a dynamic scaling factor is constructed based on a pre-set action recognition network and the absolute distance, including: Obtain depth distribution data corresponding to multiple samples in the training dataset of the action recognition network, perform statistical analysis on the depth distribution data, and obtain the average depth value; The average depth value is converted into the physical distance unit of the camera to obtain the standard reference distance; The ratio of the standard reference distance to the absolute distance is calculated to generate the dynamic scale factor.

[0015] Optionally, the normalized coordinates are input into the action recognition network, and the displacement vector field of the target skeleton key points is calculated through the motion feature enhancement layer of the action recognition network, including: The normalized coordinates within a preset time window are convolved by the temporal convolution unit in the motion feature enhancement layer to extract the three-dimensional coordinate difference features of the same target bone key points between adjacent frames. The graph convolutional units in the motion feature enhancement layer, combined with the topological connectivity of the human skeleton, perform graph structure mapping on the three-dimensional coordinate difference features to generate skeletal motion vectors. The temporal feature aggregation module in the motion feature enhancement layer performs weighted processing on the skeletal motion vector along the time dimension, and outputs the displacement vector field that represents the spatiotemporal motion trend.

[0016] To address the above problems, the present invention also provides a vision-based robot dog human motion interaction system, the system comprising: An environmental image acquisition module is used to acquire the user's environmental video stream through a binocular vision module mounted on the robot dog's torso, perform distortion correction on the environmental video stream, and obtain an image frame sequence. The key point determination module is used to input the left and right views of the image frame sequence into the pose estimation network for parallel inference to obtain the user's skeletal key points and the corresponding initial confidence scores. The key point coordinate analysis module is used to calculate the disparity value of the left and right views of the skeletal key points to extract the target skeletal key points of the skeletal key points, and to perform weighted fusion of the target skeletal key points with the initial confidence as the weight to generate the user's three-dimensional skeletal key point coordinate sequence. The coordinate normalization module is used to calculate the absolute distance between the user and the robot dog based on the coordinate sequence of the three-dimensional skeleton key points, and to construct a dynamic scale factor based on the preset action recognition network and the absolute distance to perform spatial normalization processing on the coordinates of the three-dimensional skeleton key points to obtain normalized coordinates. The motion control command generation module is used to input the normalized coordinates into the action recognition network, calculate the displacement vector field of the target skeletal key points through the motion feature enhancement layer of the action recognition network, and nonlinearly amplify the displacement vector field to output the category confidence of the user's action category. When the category confidence is greater than a preset confidence threshold, the corresponding motion control command is generated and executed.

[0017] This invention employs binocular vision to calculate absolute distance and construct a dynamic scale factor, effectively eliminating the impact of changes in the relative distance between the user and the robot dog on the accuracy of action recognition and solving the problem of difficulty in recognizing subtle movements at long distances. Simultaneously, by combining disparity value filtering and weighted fusion techniques, the robustness and accuracy of the 3D skeletal coordinates are improved. Furthermore, by extracting the displacement vector field through a motion feature enhancement layer and performing nonlinear amplification, weak motion features are significantly enhanced, thereby achieving high-precision real-time perception and interactive control of user actions by the robot dog in complex environments. Therefore, this invention can improve the interaction effect between the robot dog and the user. Attached Figure Description

[0018] Figure 1 This is a flowchart illustrating a vision-based human motion interaction method for a robotic dog according to an embodiment of the present invention. Figure 2 This is a schematic diagram of the modules for implementing the vision-based robot dog human motion interaction method according to an embodiment of the present invention; Figure 3 A schematic diagram of a computer device for a vision-based robot dog human motion interaction method provided in an embodiment of the present invention; The objectives, features, and advantages of this invention will be further explained in conjunction with the embodiments and with reference to the accompanying drawings. Detailed Implementation

[0019] It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

[0020] This application provides a vision-based robot dog human motion interaction method. The execution subject of the vision-based robot dog human motion interaction method includes, but is not limited to, at least one of the following electronic devices that can be configured to execute the method provided in this application embodiment: a server, a terminal, etc. In other words, the vision-based robot dog human motion interaction method can be executed by software or hardware installed on a terminal device or a server device. The server includes, but is not limited to, a single server, a server cluster, a cloud server, or a cloud server cluster.

[0021] Reference Figure 1The diagram shown is a flowchart illustrating a vision-based human-body interaction method for a robotic dog according to an embodiment of the present invention. In this embodiment, the vision-based human-body interaction method for a robotic dog includes: S1. The user's environmental video stream is acquired by a binocular vision module mounted on the robot dog's torso, and the environmental video stream is distorted to obtain an image frame sequence.

[0022] This invention utilizes a binocular vision module mounted on the robot dog's torso to actively acquire environmental video streams containing depth information, providing foundational data for subsequent 3D interaction. The robot dog's torso refers to its main mechanical structure, including the head, neck, back, and chest areas. The user refers to an interactive object within the robot dog's perception range who sends commands or interacts with the robot dog through limb movements and posture changes. The environmental video stream refers to a continuous temporal image signal, continuously acquired and output by the binocular vision module, containing information about the robot dog's surrounding environment and the user's image.

[0023] It should be noted that the binocular vision module includes a left eye camera and a right eye camera. The left eye camera and the right eye camera are arranged horizontally and parallel to each other. The optical axis distance between the left eye camera and the right eye camera is less than the width of the robot dog's torso, and the shooting fields of the left eye camera and the right eye camera form an overlapping field of view in front of the robot dog.

[0024] Wherein, the left eye camera refers to the imaging unit in the binocular vision module used to acquire images from the left side, the right eye camera refers to the imaging unit in the binocular vision module used to acquire images from the right side, the optical axis spacing refers to the straight-line distance between the optical centers of the left eye camera and the right eye camera, the field of view refers to the range of spatial angles that the left eye camera or the right eye camera can perceive and acquire at a specific moment, and the overlapping field of view refers to the common area formed by the overlapping of the field of view of the left eye camera and the field of view of the right eye camera in the space in front of the robot dog.

[0025] This invention performs distortion correction on the environmental video stream, resulting in an image frame sequence. Distortion correction eliminates image distortion caused by the wide-angle lens of the binocular vision module, ensuring the geometric realism of the image frame sequence and guaranteeing the coordinate accuracy of subsequent skeletal keypoint extraction, thereby improving the accuracy of the robot dog's perception of user actions. The image frame sequence refers to a continuous set of static images obtained by decomposing the environmental video stream in temporal order and correcting it using mathematical models for radial and tangential distortion. Specifically, the distortion correction of the environmental video stream is mainly achieved through a nonlinear geometric distortion correction technique based on camera calibration.

[0026] S2. Input the left and right views of the image frame sequence into the pose estimation network for parallel inference to obtain the user's skeletal key points and corresponding initial confidence levels.

[0027] This invention inputs the left and right views of the image frame sequence into a pose estimation network for parallel inference to obtain the user's skeletal key points and corresponding initial confidence scores. It makes full use of the multi-view information of binocular vision, effectively solves the problem of key point loss caused by limb self-occlusion or light interference under the user's single view, and enhances the robustness of the robot dog in extracting human motion features in complex backgrounds.

[0028] Specifically, the step of inputting the left and right views of the image frame sequence into the pose estimation network for parallel inference to obtain the user's skeletal keypoints and corresponding initial confidence scores includes: The high-dimensional semantic features of the left and right views are extracted using the feature extraction backbone network in the pose estimation network. The high-dimensional semantic features are input into the spatial attention module of the pose estimation network to generate a first attention mask for the user's human torso region. The high-dimensional semantic features are then weighted using the first attention mask to obtain weighted high-dimensional semantic features. The weighted high-dimensional semantic features are concatenated by the multi-scale feature fusion layer in the pose estimation network to generate the human body heatmap features of the left and right views; Using the aforementioned human body heatmap features, the user's skeletal key points and corresponding initial confidence levels are analyzed.

[0029] The pose estimation network refers to an algorithm architecture used to automatically detect and locate human joint points from image data and output corresponding two-dimensional coordinates and probability values. The feature extraction backbone network is used to extract low-level texture and edge information, as well as high-level shape and semantic abstraction information from image frame sequences. The left and right views refer to image pairs composed of image data acquired by the left camera and image data acquired by the right camera in the binocular vision module. The high-dimensional semantic features refer to multi-dimensional tensor data output after processing by the feature extraction backbone network. The spatial attention module refers to an attention mechanism component used to learn the importance weights of different spatial locations in the image. The user's torso region refers to the region containing the user's main... The connected pixel region of the body torso, the first attention mask refers to the weight matrix generated by the spatial attention module with the same size as the high-dimensional semantic features, the weighted high-dimensional semantic features refer to the new feature map obtained by multiplying the value of each channel of the high-dimensional semantic feature map with the corresponding weight value in the first attention mask, the human body heatmap features refer to the feature map set that the network finally outputs, which contains the probability distribution of each preset joint point of the human body, the skeletal key points refer to the coordinates of the center position of specific anatomical parts or joints that constitute the human skeleton, such as the shoulder, elbow, wrist, hip, knee, ankle, etc., and the initial confidence refers to the reliability score given by the pose estimation network for the position of each detected skeletal key point.

[0030] Furthermore, the step of analyzing the user's skeletal key points and corresponding initial confidence levels using the human body heatmap features includes: Based on the human body heatmap features, peak coordinates are located in a preset set of skeletal key point categories using a non-maximum suppression algorithm, and the peak coordinates are mapped back to the original image coordinate system to obtain the user's skeletal key points. The response values of the skeletal key points in the human body heat map are read, and the response values are normalized to obtain the initial confidence level of the skeletal key points.

[0031] The non-maximum suppression algorithm refers to the algorithm that finds the pixel with the largest response value within a specific search window of the human body heatmap as the peak point. The preset skeletal key point category set refers to a predefined dataset containing specific joint or part categories of the human body, used to limit the types and number of key points that the network needs to detect and output. The original image coordinate system refers to a two-dimensional plane coordinate system established with the upper left corner of the original image captured by the camera in the binocular vision module as the origin, and pixels as the unit. The response value refers to the numerical value at a specific pixel position in the body heatmap, which reflects the probability strength that the position belongs to a specific skeletal key point. The initial confidence level is used to quantitatively evaluate the accuracy of the detected skeletal key point position. The closer the value is to 1, the more reliable the key point detection is.

[0032] S3. Calculate the disparity value of the left and right views of the skeletal key points to extract the target skeletal key points of the skeletal key points, and perform weighted fusion of the target skeletal key points with the initial confidence level as the weight to generate the user's three-dimensional skeletal key point coordinate sequence.

[0033] This invention calculates the disparity values of the left and right views of the skeletal keypoints to extract the target skeletal keypoints. This effectively filters out interference data with mismatched left and right views, retaining only highly consistent target skeletal keypoints. The target skeletal keypoints refer to the skeletal keypoints from the detected set of all skeletal keypoints, excluding abnormal points whose initial confidence level is below a preset threshold and whose calculated disparity value is invalid or exceeds a preset disparity range.

[0034] Specifically, calculating the disparity values of the left and right views of the skeletal key points includes: Map the skeletal key points of the left view to the right view of the left and right views to generate the corresponding horizontal epipolar lines. Based on the geometric constraints of the stereo epipolar line, multiple candidate feature points for the right view are extracted within the preset search window of the horizontal epipolar line, and the matching cost between the candidate feature points for the right view and the skeletal key points in the left view is calculated. The candidate feature point for the right view with the smallest matching cost is selected as the final matching point for the right view. Calculate the absolute difference between the x-coordinate of the skeletal key point in the left view and the x-coordinate of the matching point in the right view, and use the absolute difference as the disparity value of the skeletal key point.

[0035] The left and right views refer to a pair of images acquired simultaneously by the binocular vision module, including the left-eye image acquired by the left-eye camera and the right-eye image acquired by the right-eye camera. The left view is the left-eye image in the left and right views, serving as the reference image for stereo matching. The right view is the right-eye image in the left and right views, serving as the target image for stereo matching. The horizontal epipolar line refers to a specific horizontal straight line in the stereo-corrected image pair, where, according to the epipolar geometry principle, the corresponding point in the right view must lie on a specific line. The stereo epipolar geometric constraint refers to a constraint condition established using the geometric configuration of the binocular cameras. It restricts the possible matching positions of a point in the left view in the right view to only occur on the corresponding epipolar line, thus simplifying the two-dimensional search to a one-dimensional search. The search window refers to the area within the horizontal... The theoretical corresponding position on the epipolar line is the center, and the range formed by extending a preset pixel distance to the left and right is the range formed by the following: the right view candidate feature point refers to the set of right view pixels selected as potential matching objects on the horizontal epipolar line within the search window; the matching cost refers to a numerical index used to measure the similarity between the right view candidate feature point and the skeletal key point in the left view; the right view matching point refers to the unique corresponding point among all right view candidate feature points that is determined to be most similar to the skeletal key point in the left view after the matching cost is calculated; the absolute difference is the result of subtracting the abscissa value of the right view matching point from the abscissa value of the skeletal key point in the left view and taking the absolute value; and the disparity value refers to the pixel distance in the horizontal direction between the imaging positions of the same spatial physical point in the left and right views.

[0036] Optionally, the matching cost of calculating the candidate feature points in the right view and the skeletal key points in the left view can be obtained by calculating the zero-mean normalized cross-correlation coefficient between the feature descriptors of the left view and the feature descriptors of the right view using the ZNCC algorithm.

[0037] The present invention uses the initial confidence level as a weight to perform weighted fusion of the target skeletal key points, generating a three-dimensional skeletal key point coordinate sequence of the user that can reflect the real human posture, significantly improving the accuracy of the robot dog in solving the spatial position of the user's limbs in dynamic interaction.

[0038] Specifically, the step of weighted fusion of the target skeletal keypoints using the initial confidence level as weight to generate the user's three-dimensional skeletal keypoint coordinate sequence includes: The initial confidence level and the target skeletal key points are input into a preset temporal sliding window to obtain a set of key points to be fused. Based on the initial confidence distribution, the three-dimensional coordinates of the set of key points to be fused are fused using a weighted average algorithm to generate the initial three-dimensional skeleton key point coordinates. The calculation formula for the weighted average algorithm is as follows:

[0039] in, This represents the initial coordinates of the key points of the 3D skeleton. This represents the total number of historical moments corresponding to the time-series sliding window. Indicates the first The initial confidence level of the target skeleton key points at each time step. Indicates the first The three-dimensional coordinates of key points of the target skeleton at each moment; The initial 3D skeleton key point coordinates are optimized to obtain the final 3D skeleton key point coordinate sequence.

[0040] The temporal sliding window refers to a dynamic data structure used to store and process the initial confidence level and the target skeletal keypoints. The set of keypoints to be fused refers to a subset of data extracted from the temporal sliding window during the weighted fusion operation. This set contains the target skeletal keypoints and their 3D coordinates at the current moment. The initial 3D skeletal keypoint coordinates refer to the 3D coordinate values obtained after preliminary fusion of the 3D coordinates in the set of keypoints to be fused using the weighted average algorithm. The 3D skeletal keypoint coordinate sequence refers to a data sequence containing multiple 3D skeletal keypoint coordinates arranged in chronological order.

[0041] It should be noted that the optimization of the initial 3D skeletal keypoint coordinates to obtain the final 3D skeletal keypoint coordinate sequence is achieved by smoothing the initial 3D skeletal keypoint coordinates using a smoothing factor based on human kinematic constraints. This smoothing factor is calculated by taking the difference between the predicted length and the standard length of each adjacent bone of the 3D skeletal keypoint.

[0042] S4. Based on the three-dimensional skeleton key point coordinate sequence, calculate the absolute distance between the user and the robot dog, construct a dynamic scale factor based on the preset action recognition network and the absolute distance, and perform spatial normalization processing on the three-dimensional skeleton key point coordinates to obtain normalized coordinates.

[0043] The present invention calculates the absolute distance between the user and the robot dog based on the coordinate sequence of the three-dimensional skeleton key points, which serves as the basis for constructing the dynamic scale factor in the later stage.

[0044] Specifically, calculating the absolute distance between the user and the robot dog based on the coordinate sequence of the three-dimensional skeletal key points includes: Select preset human torso key points from the three-dimensional skeletal key point coordinate sequence; Calculate the average three-dimensional coordinates of the key points of the human torso in the camera coordinate system, and determine the average three-dimensional coordinates as the center point coordinates of the user center point; Calculate the Euclidean distance between the center point coordinates and the origin of the camera coordinate system, and determine the Euclidean distance as the absolute distance between the user and the robot dog.

[0045] The human torso key points refer to a set of specific joints that represent the user's position and have relatively small range of motion during interaction, including at least two key points from the left shoulder, right shoulder, left hip, and right hip. The camera coordinate system refers to a three-dimensional rectangular coordinate system established with the optical center of the binocular vision module as the origin, the optical axis as the Z-axis, the horizontal direction of the imaging plane as the X-axis, and the vertical direction as the Y-axis. The three-dimensional coordinate average value refers to the set of values obtained by arithmetically averaging the coordinate values of all selected human torso key points in the X-axis, Y-axis, and Z-axis directions. The center point coordinates refer to the coordinate point generated based on the three-dimensional coordinate average value, used to represent the overall geometric center position of the user in three-dimensional space. The Euclidean distance refers to the straight line length connecting the center point coordinates and the origin of the camera coordinate system. The absolute distance is used to reflect the actual distance between the user and the robot dog.

[0046] This invention, based on a pre-set action recognition network and the absolute distance, constructs a dynamic scale factor to eliminate the influence of changes in the distance between the user and the robot dog on action representation, and unifies the three-dimensional coordinates at different distances into the same standard space.

[0047] Specifically, the construction of the dynamic scale factor based on the pre-set action recognition network and the absolute distance includes: Obtain depth distribution data corresponding to multiple samples in the training dataset of the action recognition network, perform statistical analysis on the depth distribution data, and obtain the average depth value; The average depth value is converted into the physical distance unit of the camera to obtain the standard reference distance; The ratio of the standard reference distance to the absolute distance is calculated to generate the dynamic scale factor.

[0048] The action recognition network refers to a neural network model that receives skeletal keypoint coordinates as input and extracts spatiotemporal features through deep learning algorithms to identify human action categories. Specifically, the construction of the action recognition network includes: first, collecting sample data to build a training set and statistically obtaining a standard reference distance; then, building a neural network architecture integrating a motion feature enhancement layer, which is used to calculate and non-linearly amplify the keypoint displacement vector field; next, using the standard reference distance to preprocess the data and supervise the training of the network to optimize parameters; and finally, by setting a confidence threshold, constructing an action recognition network that can output action confidence and trigger control commands. The training dataset refers to the set of sample data and their corresponding action labels used in constructing the action recognition network, containing a large number of different human postures, viewpoints, and distance conditions. The depth distribution data refers to the set of values extracted from the training dataset reflecting the distance between the user and the acquisition device in all samples. The depth average value refers to the value obtained by performing an arithmetic average operation on the depth distribution data. The standard reference distance is used to define the optimal interaction distance. The dynamic scale factor is a coefficient used to scale and map the 3D skeletal keypoint coordinates at the actual distance to the scale space corresponding to the standard reference distance, thereby eliminating the influence of distance changes on action recognition.

[0049] It should be explained that the spatial normalization processing of the coordinates of the three-dimensional skeleton key points to obtain normalized coordinates includes: Determine whether the absolute distance mentioned above is within the preset effective interaction distance range; If the absolute distance mentioned above is within the effective interaction distance range, then the coordinates of the three-dimensional skeleton key points are multiplied by the dynamic scale factor to obtain normalized coordinates. If the absolute distance exceeds the effective interaction distance range, the dynamic scale factor will be limited to a preset fixed extreme value to prevent the coordinate values from being distorted due to the user and the robot dog being too close or too far apart.

[0050] The effective interaction distance range refers to a range of distance values pre-set based on the imaging clarity range of the binocular vision module, baseline distance limitations, and the sensitivity of the action recognition network to the input coordinate scale. The fixed extreme value refers to the maximum scaling factor pre-set to forcibly replace the dynamic scale factor when the absolute distance exceeds the effective interaction distance range, in order to prevent overflow or severe distortion of the normalized coordinate values due to excessively large or small dynamic scale factors. The normalized coordinates refer to the 3D coordinate data obtained by multiplying the original 3D skeletal keypoint coordinates by the dynamic scale factor, eliminating the perspective projection size differences caused by changes in the relative distance between the user and the robot dog.

[0051] S5. Input the normalized coordinates into the action recognition network, calculate the displacement vector field of the target skeleton key points through the motion feature enhancement layer of the action recognition network, and nonlinearly amplify the displacement vector field to output the category confidence of the user's action category. When the category confidence is greater than the preset confidence threshold, generate and execute the corresponding motion control command.

[0052] This invention inputs the normalized coordinates into the action recognition network, and calculates the displacement vector field of the target skeleton key points through the motion feature enhancement layer of the action recognition network. This can more accurately reflect the dynamic process and trend of the user's actions, providing high-quality feature input for the subsequent category confidence calculation of the network, thereby improving the overall accuracy and real-time performance of action recognition.

[0053] Specifically, the step of inputting the normalized coordinates into the action recognition network and calculating the displacement vector field of the target skeletal key points through the motion feature enhancement layer of the action recognition network includes: The normalized coordinates within a preset time window are convolved by the temporal convolution unit in the motion feature enhancement layer to extract the three-dimensional coordinate difference features of the same target bone key points between adjacent frames. The graph convolutional units in the motion feature enhancement layer, combined with the topological connectivity of the human skeleton, perform graph structure mapping on the three-dimensional coordinate difference features to generate skeletal motion vectors. The temporal feature aggregation module in the motion feature enhancement layer performs weighted processing on the skeletal motion vector along the time dimension, and outputs the displacement vector field that represents the spatiotemporal motion trend.

[0054] The temporal convolutional unit is used to capture the local dependencies of skeletal keypoints in the time series. The three-dimensional coordinate difference feature quantifies the displacement velocity and direction of skeletal keypoints in a very short time. The graph convolutional unit is a unit that can aggregate the feature information of neighboring nodes according to a preset skeletal connection structure. The human skeleton is usually abstracted in computer vision as a skeleton model containing a specific number of joints to represent the posture of the human body. The topological connection relationship refers to the natural physical connection between the joints in the human skeleton, such as "wrist" connecting "elbow" and "elbow" connecting "shoulder". The skeletal motion vector is a vector representation that combines the motion of the joint itself and the motion information of its neighboring joints after being processed by the graph convolutional unit. The temporal feature aggregation module is a feature processing component at the back end of the motion feature enhancement layer, used to integrate the skeletal motion vectors distributed along the time dimension. The displacement vector field represents the overall trend of user actions in the spatiotemporal dimension.

[0055] It should be noted that the temporal feature aggregation module aims to weight and fuse motion features at different times into a fixed-dimensional feature representation.

[0056] This invention nonlinearly amplifies the displacement vector field to output the category confidence score of the user's action, which significantly enhances weak feature signals in distant or minute movements, solving the problem that the movement amplitude is too small to be recognized in the image due to the distance. The category confidence score refers to the probability value used to quantify whether the user's action belongs to each category in a preset action set.

[0057] In detail, the nonlinear amplification of the displacement vector field to output the category confidence of the user's action category is achieved through a nonlinear activation function.

[0058] Finally, when the category confidence level is greater than a preset confidence threshold, the present invention generates and executes corresponding motion control commands, achieving efficient interaction between the user and the robot dog. Specifically, generating and executing corresponding motion control commands when the category confidence level is greater than the preset confidence threshold includes: if the category confidence level is determined to be greater than the preset confidence threshold, then according to the determined target action category, querying the corresponding control parameters in a preset action-command mapping table, and generating corresponding motion control commands; wherein, the action-command mapping table defines the correspondence between different user actions and robot dog movement behaviors, for example, user actions include waving, forward gestures, and stepping in place. The motion control commands include motion type identifiers: forward, backward, lateral, rotating in place, squatting, standing, running, climbing stairs, dance movements, etc.; target speed parameters: linear velocity and angular velocity; posture control parameters: lying down, raising head, turning sideways, etc.; gait planning parameters: stride length, stride frequency, leg lift height, and foot contact force, etc.

[0059] This invention employs binocular vision to calculate absolute distance and construct a dynamic scale factor, effectively eliminating the impact of changes in the relative distance between the user and the robot dog on the accuracy of action recognition and solving the problem of difficulty in recognizing subtle movements at long distances. Simultaneously, by combining disparity value filtering and weighted fusion techniques, the robustness and accuracy of the 3D skeletal coordinates are improved. Furthermore, by extracting the displacement vector field through a motion feature enhancement layer and performing nonlinear amplification, weak motion features are significantly enhanced, thereby achieving high-precision real-time perception and interactive control of user actions by the robot dog in complex environments. Therefore, this invention can improve the interaction effect between the robot dog and the user.

[0060] like Figure 2 The diagram shown is a functional block diagram of the vision-based robot dog human motion interaction system of the present invention.

[0061] The vision-based robot dog human motion interaction system 200 described in this invention can be installed in an electronic device. Depending on the functions implemented, the vision-based robot dog human motion interaction system may include an environmental image acquisition module 201, a key point determination module 202, a key point coordinate analysis module 203, a coordinate normalization module 204, and a motion control command generation module 205. The module described in this invention can also be called a unit, which refers to a series of computer program segments that can be executed by the processor of an electronic device and can perform a fixed function, and are stored in the memory of the electronic device.

[0062] In this embodiment of the invention, the functions of each module / unit are as follows: The environmental image acquisition module 201 is used to acquire the user's environmental video stream through a binocular vision module mounted on the robot dog's torso, perform distortion correction on the environmental video stream, and obtain an image frame sequence. The key point determination module 202 is used to input the left and right views of the image frame sequence into the pose estimation network for parallel inference to obtain the user's skeletal key points and the corresponding initial confidence scores. The key point coordinate analysis module 203 is used to calculate the disparity value of the left and right views of the skeletal key points, so as to extract the target skeletal key points of the skeletal key points, and to perform weighted fusion of the target skeletal key points with the initial confidence as the weight to generate the user's three-dimensional skeletal key point coordinate sequence. The coordinate normalization module 204 is used to calculate the absolute distance between the user and the robot dog based on the coordinate sequence of the three-dimensional skeleton key points, and construct a dynamic scale factor based on the preset action recognition network and the absolute distance to perform spatial normalization processing on the coordinates of the three-dimensional skeleton key points to obtain normalized coordinates. The motion control command generation module 205 is used to input the normalized coordinates into the action recognition network, calculate the displacement vector field of the target skeletal key points through the motion feature enhancement layer of the action recognition network, and nonlinearly amplify the displacement vector field to output the category confidence of the user's action category. When the category confidence is greater than a preset confidence threshold, the corresponding motion control command is generated and executed.

[0063] In detail, the modules in the vision-based robot dog human motion interaction system 200 described in this embodiment of the invention employ the same methods as described above during use. Figure 1 The method uses the same technical means as the vision-based robot dog human motion interaction method described in the article and can produce the same technical effect, so it will not be repeated here.

[0064] In one embodiment, a computer device is provided, which may be a server or a client, and its internal structure diagram may be as follows: Figure 3 As shown, the computer device includes a processor, memory, network interface, and database connected via a system bus. The processor provides computing and control capabilities. The memory includes non-volatile and / or volatile storage media and internal memory. The non-volatile storage media stores the operating system, computer programs, and database. The internal memory provides an environment for the operation of the operating system and computer programs stored in the non-volatile storage media. The network interface is used to communicate with external clients via a network connection. When the computer program is executed by the processor, it implements the functions or steps of a vision-based robot dog human motion interaction method on the server or client side.

[0065] In one embodiment, a computer device is provided, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to perform the following steps: The robot dog acquires a video stream of the user's environment by using a binocular vision module mounted on its torso, and performs distortion correction on the video stream to obtain an image frame sequence. The left and right views of the image frame sequence are respectively input into the pose estimation network for parallel inference to obtain the user's skeletal key points and the corresponding initial confidence scores; Calculate the disparity values of the left and right views of the skeletal key points to extract the target skeletal key points of the skeletal key points, and perform weighted fusion of the target skeletal key points with the initial confidence as the weight to generate the user's three-dimensional skeletal key point coordinate sequence. Based on the three-dimensional skeleton key point coordinate sequence, the absolute distance between the user and the robot dog is calculated. A dynamic scale factor is constructed based on the preset action recognition network and the absolute distance to perform spatial normalization processing on the three-dimensional skeleton key point coordinates to obtain normalized coordinates. The normalized coordinates are input into the action recognition network. The motion feature enhancement layer of the action recognition network calculates the displacement vector field of the target skeletal key points and performs nonlinear amplification on the displacement vector field to output the category confidence of the user's action category. When the category confidence is greater than a preset confidence threshold, the corresponding motion control command is generated and executed.

[0066] In one embodiment, a computer-readable storage medium is provided having a computer program stored thereon, the computer program performing the following steps when executed by a processor: The robot dog acquires a video stream of the user's environment by using a binocular vision module mounted on its torso, and performs distortion correction on the video stream to obtain an image frame sequence. The left and right views of the image frame sequence are respectively input into the pose estimation network for parallel inference to obtain the user's skeletal key points and the corresponding initial confidence scores; Calculate the disparity values of the left and right views of the skeletal key points to extract the target skeletal key points of the skeletal key points, and perform weighted fusion of the target skeletal key points with the initial confidence as the weight to generate the user's three-dimensional skeletal key point coordinate sequence. Based on the three-dimensional skeleton key point coordinate sequence, the absolute distance between the user and the robot dog is calculated. A dynamic scale factor is constructed based on the preset action recognition network and the absolute distance to perform spatial normalization processing on the three-dimensional skeleton key point coordinates to obtain normalized coordinates. The normalized coordinates are input into the action recognition network. The motion feature enhancement layer of the action recognition network calculates the displacement vector field of the target skeletal key points and performs nonlinear amplification on the displacement vector field to output the category confidence of the user's action category. When the category confidence is greater than a preset confidence threshold, the corresponding motion control command is generated and executed.

[0067] It should be noted that the functions or steps that can be implemented by the computer-readable storage medium or computer device described above can be referred to the relevant descriptions on the server side and client side in the foregoing method embodiments. To avoid repetition, they will not be described one by one here.

[0068] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium. When executed, the computer program can include the processes of the embodiments of the above methods. Any references to memory, storage, databases, or other media used in the embodiments provided in this application can include non-volatile and / or volatile memory. Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link DRAM (SLDRAM), RAMbus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

[0069] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the above-described division of functional units and modules is used as an example. In practical applications, the above functions can be assigned to different functional units and modules as needed, that is, the internal structure of the device can be divided into different functional units or modules to complete all or part of the functions described above.

[0070] It will be apparent to those skilled in the art that the present invention is not limited to the details of the exemplary embodiments described above, and that the present invention can be implemented in other specific forms without departing from the spirit or essential characteristics of the present invention.

[0071] Finally, it should be noted that in the above embodiments, each embodiment can be combined with each other or independent. Deleting any one of them will not affect the technical implementation of other embodiments. The above embodiments are only used to illustrate the technical solutions of the present invention and not to limit it. Although the present invention has been described in detail with reference to preferred embodiments, those skilled in the art should understand that modifications or equivalent substitutions can be made to the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims

1. A vision-based method for human-computer interaction in robotic dogs, characterized in that, The method includes: The robot dog acquires a video stream of the user's environment by using a binocular vision module mounted on its torso, and performs distortion correction on the video stream to obtain an image frame sequence. The left and right views of the image frame sequence are respectively input into the pose estimation network for parallel inference to obtain the user's skeletal key points and the corresponding initial confidence scores; Calculate the disparity values of the left and right views of the skeletal key points to extract the target skeletal key points of the skeletal key points, and perform weighted fusion of the target skeletal key points with the initial confidence as the weight to generate the user's three-dimensional skeletal key point coordinate sequence. Based on the three-dimensional skeleton key point coordinate sequence, the absolute distance between the user and the robot dog is calculated. A dynamic scale factor is constructed based on the preset action recognition network and the absolute distance to perform spatial normalization processing on the three-dimensional skeleton key point coordinates to obtain normalized coordinates. The normalized coordinates are input into the action recognition network. The motion feature enhancement layer of the action recognition network calculates the displacement vector field of the target skeletal key points and performs nonlinear amplification on the displacement vector field to output the category confidence of the user's action category. When the category confidence is greater than a preset confidence threshold, the corresponding motion control command is generated and executed.

2. The vision-based human motion interaction method for a robotic dog as described in claim 1, characterized in that, The left and right views of the image frame sequence are respectively input into the pose estimation network for parallel inference to obtain the user's skeletal keypoints and corresponding initial confidence scores, including: The high-dimensional semantic features of the left and right views are extracted using the feature extraction backbone network in the pose estimation network. The high-dimensional semantic features are input into the spatial attention module of the pose estimation network to generate a first attention mask for the user's human torso region. The high-dimensional semantic features are then weighted using the first attention mask to obtain weighted high-dimensional semantic features. The weighted high-dimensional semantic features are concatenated by the multi-scale feature fusion layer in the pose estimation network to generate the human body heatmap features of the left and right views; Using the aforementioned human body heatmap features, the user's skeletal key points and corresponding initial confidence levels are analyzed.

3. The vision-based human motion interaction method for a robotic dog as described in claim 2, characterized in that, Using the aforementioned human body heatmap features, the user's skeletal key points and corresponding initial confidence levels are analyzed, including: Based on the human body heatmap features, peak coordinates are located in a preset set of skeletal key point categories using a non-maximum suppression algorithm, and the peak coordinates are mapped back to the original image coordinate system to obtain the user's skeletal key points. The response values of the skeletal key points in the human body heat map are read, and the response values are normalized to obtain the initial confidence level of the skeletal key points.

4. The vision-based human motion interaction method for a robotic dog as described in claim 1, characterized in that, The binocular vision module includes a left camera and a right camera. The left camera and the right camera are arranged horizontally and parallel to each other. The optical axis distance between the left camera and the right camera is less than the width of the robot dog's torso, and the shooting fields of the left camera and the right camera form an overlapping field of view in front of the robot dog.

5. The vision-based human motion interaction method for a robotic dog as described in claim 1, characterized in that, Calculating the disparity values of the left and right views of the skeletal keypoints includes: Map the skeletal key points of the left view to the right view of the left and right views to generate the corresponding horizontal epipolar lines. Based on the geometric constraints of the stereo epipolar line, multiple candidate feature points for the right view are extracted within the preset search window of the horizontal epipolar line, and the matching cost between the candidate feature points for the right view and the skeletal key points in the left view is calculated. The candidate feature point for the right view with the smallest matching cost is selected as the final matching point for the right view. Calculate the absolute difference between the x-coordinate of the skeletal key point in the left view and the x-coordinate of the matching point in the right view, and use the absolute difference as the disparity value of the skeletal key point.

6. The vision-based human motion interaction method for a robotic dog as described in claim 1, characterized in that, The target skeletal keypoints are weighted and fused using the initial confidence level as the weight to generate the user's three-dimensional skeletal keypoint coordinate sequence, including: The initial confidence level and the target skeletal key points are input into a preset temporal sliding window to obtain a set of key points to be fused. Based on the initial confidence distribution, the three-dimensional coordinates of the set of key points to be fused are fused using a weighted average algorithm to generate the initial three-dimensional skeleton key point coordinates; The calculation formula for the weighted average algorithm is as follows: in, This represents the initial coordinates of the key points of the 3D skeleton. This represents the total number of historical moments corresponding to the time-series sliding window. Indicates the first The initial confidence level of the target skeleton key points at each time step. Indicates the first The three-dimensional coordinates of key points of the target skeleton at each moment; The initial 3D skeleton key point coordinates are optimized to obtain the final 3D skeleton key point coordinate sequence.

7. The vision-based human motion interaction method for a robotic dog as described in claim 1, characterized in that, Based on the three-dimensional skeletal keypoint coordinate sequence, the absolute distance between the user and the robot dog is calculated, including: Select preset human torso key points from the three-dimensional skeletal key point coordinate sequence; Calculate the average three-dimensional coordinates of the key points of the human torso in the camera coordinate system, and determine the average three-dimensional coordinates as the center point coordinates of the user center point; Calculate the Euclidean distance between the center point coordinates and the origin of the camera coordinate system, and determine the Euclidean distance as the absolute distance between the user and the robot dog.

8. The vision-based human motion interaction method for a robotic dog as described in claim 1, characterized in that, A dynamic scale factor is constructed based on a pre-set action recognition network and the absolute distance, including: Obtain depth distribution data corresponding to multiple samples in the training dataset of the action recognition network, perform statistical analysis on the depth distribution data, and obtain the average depth value; The average depth value is converted into the physical distance unit of the camera to obtain the standard reference distance; The ratio of the standard reference distance to the absolute distance is calculated to generate the dynamic scale factor.

9. The vision-based human motion interaction method for a robotic dog as described in claim 1, characterized in that, The normalized coordinates are input into the action recognition network, and the displacement vector field of the target skeleton key points is calculated through the motion feature enhancement layer of the action recognition network, including: The normalized coordinates within a preset time window are convolved by the temporal convolution unit in the motion feature enhancement layer to extract the three-dimensional coordinate difference features of the same target bone key points between adjacent frames. The graph convolutional units in the motion feature enhancement layer, combined with the topological connectivity of the human skeleton, perform graph structure mapping on the three-dimensional coordinate difference features to generate skeletal motion vectors. The temporal feature aggregation module in the motion feature enhancement layer performs weighted processing on the skeletal motion vector along the time dimension, and outputs the displacement vector field that represents the spatiotemporal motion trend.

10. A vision-based robot dog human motion interaction system, characterized in that, The system includes: An environmental image acquisition module is used to acquire the user's environmental video stream through a binocular vision module mounted on the robot dog's torso, perform distortion correction on the environmental video stream, and obtain an image frame sequence. The key point determination module is used to input the left and right views of the image frame sequence into the pose estimation network for parallel inference to obtain the user's skeletal key points and the corresponding initial confidence scores. The key point coordinate analysis module is used to calculate the disparity value of the left and right views of the skeletal key points to extract the target skeletal key points of the skeletal key points, and to perform weighted fusion of the target skeletal key points with the initial confidence as the weight to generate the user's three-dimensional skeletal key point coordinate sequence. The coordinate normalization module is used to calculate the absolute distance between the user and the robot dog based on the coordinate sequence of the three-dimensional skeleton key points, and to construct a dynamic scale factor based on the preset action recognition network and the absolute distance to perform spatial normalization processing on the coordinates of the three-dimensional skeleton key points to obtain normalized coordinates. The motion control command generation module is used to input the normalized coordinates into the action recognition network, calculate the displacement vector field of the target skeletal key points through the motion feature enhancement layer of the action recognition network, and nonlinearly amplify the displacement vector field to output the category confidence of the user's action category. When the category confidence is greater than a preset confidence threshold, the corresponding motion control command is generated and executed.