Robot motion control method and apparatus, electronic device, and readable storage medium

By generating motion control signals through feature extraction and prediction networks based on channel attention mechanisms, the problem of mismatch between robot actions and environment is solved, and the accuracy and adaptability of robot motion control in complex environments are improved.

WO2026123416A1PCT designated stage Publication Date: 2026-06-18UBTECH ROBOTICS CORP LTD

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
UBTECH ROBOTICS CORP LTD
Filing Date
2024-12-28
Publication Date
2026-06-18

AI Technical Summary

Technical Problem

In existing technologies, robot movements do not match the actual environment, resulting in poor motion performance.

Method used

The robot's movement is controlled by using a pre-trained target feature extraction network based on channel attention mechanism to extract environmental features and combining them with a joint position state input prediction network to generate initial motion control signals.

🎯Benefits of technology

It improves the accuracy and adaptability of robots' visual perception and motion control in complex environments, reduces mismatches between actions and the actual environment, and enables precise execution of motion tasks.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN2024143523_18062026_PF_FP_ABST
    Figure CN2024143523_18062026_PF_FP_ABST
Patent Text Reader

Abstract

A robot motion control method and apparatus, an electronic device, and a readable storage medium. The method comprises: obtaining a first environment image currently collected by a target robot and a current first joint position state of the target robot; using a target feature extraction network based on a channel attention mechanism to perform environment feature extraction on the first environment image, to obtain a first environment feature; inputting the first environment feature and the first joint position state into a prediction network, to obtain an initial motion control signal, the prediction network being obtained by performing training on the basis of sample environment features, sample joint position states, and sample motion control signals respectively corresponding to a plurality of moments; and controlling the motion of the target robot on the basis of the initial motion control signal. In this way, an important feature in a visual input is accurately extracted, and is then combined with a joint position state to generate a signal for performing motion control on a robot, thereby reducing situations where actions of the robot do not match actual environments.
Need to check novelty before this filing date? Find Prior Art

Description

Robot motion control methods, devices, electronic devices and readable storage media

[0001] This application claims priority to Chinese Patent Application No. 202411804453.X, filed on December 9, 2024, entitled "Robot Motion Control Method, Apparatus, Electronic Device and Readable Storage Medium", the entire contents of which are incorporated herein by reference. Technical Field

[0002] This application relates to the field of robotics, and more specifically, to a robot motion control method, device, electronic device, and readable storage medium. Background Technology

[0003] Currently, robots are generally controlled through program instructions that specify fixed actions. However, because the robot's actions are fixed, this can lead to a mismatch between the robot's movements and the actual environment, resulting in poor robot performance. Technical issues

[0004] This application provides a robot motion control method, device, electronic device, and readable storage medium, which can accurately extract important features from visual input and then combine them with joint position states to generate signals for motion control of the robot, thereby reducing the mismatch between the robot's actions and the actual environment. Technical solutions

[0005] The embodiments of this application can be implemented as follows:

[0006] In a first aspect, embodiments of this application provide a robot motion control method, the method comprising:

[0007] Obtain the first environmental image currently acquired by the target robot and the current position state of the first joint of the target robot;

[0008] The first environmental features are obtained by using a pre-trained target feature extraction network based on channel attention mechanism to extract environmental features from the first environmental image.

[0009] The first environmental feature and the first joint position state are input into the prediction network to obtain the initial motion control signal. The prediction network is trained based on the sample environmental features, sample joint position states and sample motion control signals corresponding to multiple time points. The sample environmental features and joint position states are used as samples, and the sample motion control signals are used as the labels corresponding to the samples.

[0010] The target robot is controlled to move according to the initial motion control signal.

[0011] Secondly, embodiments of this application provide a robot motion control device, the device comprising:

[0012] The acquisition module is used to obtain the first environmental image currently acquired by the target robot and the current position state of the first joint of the target robot;

[0013] The processing module is used to extract environmental features from the first environmental image using a pre-trained target feature extraction network based on channel attention mechanism, so as to obtain the first environmental features;

[0014] The processing module is further configured to input the first environmental features and the first joint position state into the prediction network to obtain an initial motion control signal. The prediction network is trained based on the sample environmental features, sample joint position state and sample motion control signal corresponding to multiple time points. The sample environmental features and joint position state are used as samples, and the sample motion control signal is used as the label corresponding to the sample.

[0015] The control module is used to control the movement of the target robot according to the initial motion control signal.

[0016] Thirdly, embodiments of this application provide an electronic device, including a processor and a memory, wherein the memory stores machine-executable instructions that can be executed by the processor, and the processor can execute the machine-executable instructions to implement the robot motion control method described in the foregoing embodiments.

[0017] Fourthly, embodiments of this application provide a readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the robot motion control method as described in the foregoing embodiments. Beneficial effects

[0018] The robot motion control method, apparatus, electronic device, and readable storage medium provided in this application first obtain a first environmental image of the target robot and the current first joint position state of the target robot. Then, an environmental feature extraction network based on a pre-trained channel attention mechanism is used to extract environmental features from the first environmental image to obtain first environmental features. Next, the first environmental features and the first joint position state are input into a prediction network to obtain an initial motion control signal. The prediction network is trained based on sample environmental features, sample joint position states, and sample motion control signals corresponding to multiple time points. The sample environmental features and joint position states are used as samples, and the sample motion control signals are used as labels corresponding to the samples. Finally, the target robot is controlled to move according to the initial motion control signal. In this way, the target feature extraction network can accurately extract important features from the visual input, and then, combined with the joint position state, a signal for motion control of the robot is generated through the prediction network. This reduces the mismatch between the robot's actions and the actual environment, facilitating the driving of the robot to perform precise motion tasks. Attached Figure Description

[0019] To more clearly illustrate the technical solutions of the embodiments of this application, the accompanying drawings used in the embodiments will be briefly introduced below. It should be understood that the following drawings only show some embodiments of this application and should not be regarded as a limitation of the scope. For those skilled in the art, other related drawings can be obtained based on these drawings without creative effort.

[0020] Figure 1 is a block diagram of the electronic device provided in an embodiment of this application;

[0021] Figure 2 is one of the flowcharts of the robot motion control method provided in the embodiments of this application;

[0022] Figure 3 is a schematic diagram of the principle of the robot motion control method provided in the embodiment of this application;

[0023] Figure 4 is a second schematic flowchart of the robot motion control method provided in the embodiments of this application;

[0024] Figure 5 is a block diagram of one of the robot motion control devices provided in the embodiments of this application;

[0025] Figure 6 is a second block diagram of the robot motion control device provided in the embodiments of this application.

[0026] Icons: 100 - Electronic device; 110 - Memory; 120 - Processor; 130 - Communication unit; 200 - Robot motion control device; 201 - Training module; 210 - Data acquisition module; 220 - Processing module; 230 - Control module. Embodiments of the present invention

[0027] To make the objectives, technical solutions, and advantages of the embodiments of this application clearer, the technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. The components of the embodiments of this application described and shown in the accompanying drawings can generally be arranged and designed in various different configurations.

[0028] Therefore, the following detailed description of the embodiments of this application provided in the accompanying drawings is not intended to limit the scope of the claimed application, but merely to illustrate selected embodiments of the application. All other embodiments obtained by those skilled in the art based on the embodiments of this application without inventive effort are within the scope of protection of this application.

[0029] It should be noted that relational terms such as "first" and "second" are used merely to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.

[0030] The following detailed description of some embodiments of this application is provided in conjunction with the accompanying drawings. Unless otherwise specified, the following embodiments and features can be combined with each other.

[0031] Please refer to Figure 1, which is a block diagram of an electronic device 100 provided in an embodiment of this application. The electronic device 100 may be, but is not limited to, a server, a robot, etc. The electronic device 100 includes a memory 110, a processor 120, and a communication unit 130. The memory 110, processor 120, and communication unit 130 are electrically connected to each other directly or indirectly to achieve data transmission or interaction. For example, these components can be electrically connected to each other through one or more communication buses or signal lines.

[0032] The memory 110 is used to store programs or data. The memory 110 may be, but is not limited to, random access memory (RAM), read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), etc.

[0033] The processor 120 is used to read / write data or programs stored in the memory 110 and execute corresponding functions. For example, the memory 110 stores a robot motion control device 200, which includes at least one software function module that can be stored in the memory 110 in the form of software or firmware. The processor 120 executes various functional applications and data processing by running the software programs and modules stored in the memory 110, such as the robot motion control device 200 in this embodiment, thereby realizing the robot motion control method in this embodiment.

[0034] The communication unit 130 is used to establish a communication connection between the electronic device 100 and other communication terminals through the network, and to send and receive data through the network.

[0035] It should be understood that the structure shown in Figure 1 is only a schematic diagram of the electronic device 100. The electronic device 100 may also include more or fewer components than shown in Figure 1, or have a different configuration than shown in Figure 1. The components shown in Figure 1 may be implemented using hardware, software, or a combination thereof.

[0036] Please refer to Figure 2, which is a flowchart illustrating one of the robot motion control methods provided in this application embodiment. The method is applied to the aforementioned electronic device 100. The specific flow of the robot motion control method is described in detail below. In this embodiment, the method may include steps S110 to S140.

[0037] Step S110: Obtain the first environmental image currently acquired by the target robot and the current position state of the first joint of the target robot.

[0038] In this embodiment, the target robot is the robot for which the current robot motion control is applied. The target robot may include a camera; the specific installation location and number of cameras can be determined based on actual needs and are not specifically limited here. The image currently captured by the camera can be used as the first environmental image currently captured by the target robot. The current joint position state of the target robot can also be obtained as the first joint position state. The first joint position state is used to indicate the current position of each joint of the target robot (i.e., the angle of the motors of each joint). The first joint position state may include the current position of each joint. The first environmental image and the first joint position state are used to describe the environment and the joint state of the target robot at a given moment.

[0039] Step S120: Use a pre-trained target feature extraction network based on channel attention mechanism to extract environmental features from the first environmental image to obtain the first environmental features.

[0040] Step S130: Input the first environmental features and the first joint position state into the prediction network to obtain the initial motion control signal.

[0041] In this embodiment, the inventors of this application discovered that if the first environmental image and the first joint position state are directly input into the prediction network to predict the control signal, it may not accurately extract subtle features in the complex environment for prediction, which could lead to poor predicted control signal. The channel attention mechanism, by adjusting the importance of each channel, improves the ability to focus on key features, enabling the network to better process the channel-level information of the visual input.

[0042] Based on the above considerations, in this embodiment, upon obtaining the first environmental image, the first environmental image is input into a pre-trained target feature extraction network using a channel attention mechanism, and the output of the target feature extraction network based on the first environmental image is used as the first environmental feature. The specific target feature extraction network can be determined according to actual needs and is not specifically limited here.

[0043] Having obtained the first environmental feature, the first environmental feature and the first joint position state can be input together into the prediction network, and the initial motion control signal can be obtained from the output of the prediction network for the first environmental feature and the first joint position state. The prediction network is trained based on sample environmental features, sample joint position states, and sample motion control signals corresponding to multiple time points, with the sample environmental features and joint position states serving as samples and the sample motion control signals serving as labels for the samples. That is, the prediction network is trained based on sample environmental features and joint position states as samples and sample motion control signals as labels for the samples, using these parameters at multiple time points.

[0044] Step S140: Control the target robot to move according to the initial motion control signal.

[0045] In this embodiment, upon obtaining the initial motion control signal, it can be directly used as the signal for control during operation, controlling the target robot's movement so that the joint position state of the target robot at the next moment is the joint position state represented by the initial motion control signal. Alternatively, the initial motion control signal can be adjusted, and the adjusted signal can be used as the signal for control during operation, controlling the target robot's movement so that the joint position state of the target robot at the next moment is the joint position state represented by the adjusted initial motion control signal. The specific method of controlling the target robot's movement based on the initial motion control signal can be determined according to actual needs and is not specifically limited here.

[0046] Optionally, as a possible implementation, the target robot's head and / or wrist are equipped with cameras. Optionally, the wrist camera acquires only one image from one viewpoint, or it can acquire one image from two different viewpoints (i.e., the wrist camera can acquire two images simultaneously). Correspondingly, the image currently acquired by the robot's head camera and / or the image currently acquired by the robot's wrist camera can be used as the first environmental image currently acquired by the target robot; that is, the first environmental image includes image data acquired by the head camera and / or image data acquired by the wrist camera. It is understood that the type of image data corresponding to the sample environmental features used in the training process of the prediction network is the same as the type of image data in the first environmental image. For example, all of them are image data acquired by the head camera, or all of them are image data acquired by the wrist camera, or all of them include image data acquired by both the head camera and the wrist camera. This facilitates the control of the target robot's robotic arm movement using the image data acquired by the head camera and / or the image data acquired by the wrist camera.

[0047] The current positions of each joint of the target robot can also be obtained in any way to obtain the first joint position state. It is understood that if the target robot includes only one joint, the first joint position state only includes the joint position of that joint; if the target robot includes multiple joints, the first joint position state includes the joint positions of each of the multiple joints. The first environmental image and the first joint position state are used to describe the current environment and the state of the target robot. If it is desired to control the movement of the target robot's robotic arm, the first joint position state may include the joint positions of each joint of the robotic arm.

[0048] Optionally, the target feature extraction network is an SEnet network or an ECA-Net network.

[0049] In one example implementation, as shown in Figure 3, the target feature extraction network is a SENet network, i.e., a SENet encoder. The SENet encoder encodes the input image data x (i.e., the first environment image mentioned above, or Image data in Figure 3) through a channel attention mechanism, generating a feature vector U (i.e., the first environment feature mentioned above). This feature vector can capture important visual information in the environment. SENet adaptively enhances the weights of important channels through two steps: "Squeeze" and "Excitation," and extracts key visual features through "Re-weighting." Specifically, the Squeeze operation compresses the spatial information of channel features into a global description; the Excitation operation enhances important channel features through adaptive weight adjustment; and the Re-weighting operation uses the learned weights to reweight the feature map, obtaining a new feature map. The SENet encoding process can be represented as: U = f tr (x), where f tr This indicates that environmental features U are extracted using the SEnet encoder.

[0050] In this embodiment, the prediction network can be specifically determined based on actual needs. As one possible implementation, the prediction network is a generative model, and the specific generative model can be determined based on actual needs. The inventors of this application have discovered that a Conditional Variational Autoencoder (CVAE) can generate output data that meets specific requirements based on input conditional information, making it very suitable for robot state estimation and control command generation tasks. Therefore, the prediction network can be a CVAE.

[0051] In this embodiment, the prediction network may include an encoder and a decoder. The first environmental feature and the first joint position state can be input into the encoder to generate latent variables based on the first environmental feature and the first joint position state. Then, the latent variables can be input into the decoder to decode based on the generated latent variables, thereby generating the initial motion control signal.

[0052] As one implementation method, as shown in Figure 3, the prediction network is a CVAE network. The first environmental feature U and the first joint position state y (i.e., joint(p) in Figure 3) are input into the CVAE encoder to generate a latent variable z. This latent energy z can be regarded as an abstract representation of the first environmental image and the first joint position state. This process can be expressed by the following formula: z: q(z|U,y)=N(μ(U,y),σ 2(U,y)), where μ(U,y) represents the mean calculated by the encoder (i.e., the mean of the latent variables), σ 2 (U,y) represents the variance calculated by the encoder (i.e., the variance of the latent variables), μ(U,y) and σ 2 (U,y) reflects the potential distribution of visual features and joint position states. Then, the initial motion control signal is generated by decoding the latent variable z using a CVAE decoder. Both the CVAE encoder and CVAE decoder are transform networks.

[0053] Optionally, after obtaining the initial motion control signal, to avoid jitter, a target motion control signal can be calculated by smoothing multiple historical motion control signals and the initial motion control signal, and then the target robot's motion can be controlled according to the target motion control signal. The joint position state of the target robot at the next moment is the state indicated by the target motion control signal. In this way, the generated initial motion control signal can be optimized to obtain the target motion control signal, ensuring the motion accuracy and stability of the target robot. Afterwards, the target robot can be driven to perform actions according to the target motion control signal corresponding to the current environment and joint position state.

[0054] Optionally, as shown in Figure 3, a moving average filter can be used for smoothing to obtain the target motion control signal. By using the moving average filter for smoothing, a preset number of recent historical motion control signals and the initial motion control signal can be obtained. The average joint position indicated by these motion control signals is then used as the joint position indicated by the target motion control signal.

[0055] In this embodiment, SEnet can be used to encode the image data currently acquired by the camera, extract important features from the image, and input them along with the current robot joint position state into a CVAE model to generate a latent representation. Then, the CVAE model decodes the latent representation to generate preliminary motion control signals, which are further processed to generate final joint control signals, thereby driving the robot to perform precise motion tasks. This approach is a visual motion control method based on a conditional variational autoencoder (CVAE) and a channel attention mechanism (SEnet). Combining SEnet's channel attention mechanism with the generation capabilities of CVAE, it is expected to improve the accuracy and adaptability of the robot's visual perception and motion control in complex environments, thereby enhancing the robot's motion control precision in complex environments.

[0056] The aforementioned visual motion control method based on conditional variational autoencoder (CVAE) and channel attention mechanism (SEnet) can generate high-precision control commands by enhancing the extraction and analysis of important features of image channels and combining them with robot joint state data. It is suitable for robot motion control in complex environments.

[0057] Please refer to Figure 4, which is a second schematic flowchart of the robot motion control method provided in this embodiment. In this embodiment, before step S110, the method may further include steps S101 to S102.

[0058] Step S101: Obtain multiple sample data.

[0059] Step S102: Train the initial model based on the multiple sample data to obtain the control strategy generation model.

[0060] In this embodiment, a large amount of raw image data and corresponding raw joint position states captured by the robot's head camera can be collected. This data is then preprocessed to obtain multiple sample data sets. Each sample data set includes a sample environment image, a sample joint position state, and a sample motion control signal corresponding to a given time moment. That is, one sample data set represents the second environment image, second joint position state, and motion control signal of a robot at time T. The second environment image at time T describes the environment at time T, the second joint position state at time T describes the robot's state at time T, and the motion control signal at time T is the motion control signal the robot will execute at time T. This data can be used to train an initial model to obtain a control strategy generation model, ensuring that the control strategy generation model can generate high-precision motion control signals in different environments. During training, the second environment image and second joint position state in one sample data set are used as samples, and the sample motion control signal in that sample data set is used as the label corresponding to the sample. The control strategy generation model includes the target feature extraction network and the prediction network. That is, a control strategy generation model including the target feature extraction network and the prediction network is trained end-to-end based on the above sample data.

[0061] In this embodiment, the target feature extraction network is the SENet network, which can be trained to extract important features from the image. SENet compresses the spatial information of each channel into a global feature description through a "compression" operation, and then dynamically adjusts the weights according to the importance of different channels through an "activation" operation to optimize the feature extraction capability. Finally, the feature vector U output by SENet is the compressed feature representation of the image.

[0062] In this embodiment, the prediction network is a CVAE model. The CVAE model generates a latent variable z by learning the joint distribution of visual features U and joint position states y, and then generates motion control signals through a decoder. The training objective of CVAE is to minimize the reconstruction loss and the KL divergence loss, the sum of which is shown below: Γ CVAE =E q(z|U,y) [logp(y|z)]-D KL ([q(z|U,y)||p(z)])

[0063] Among them, the reconstruction loss is used to ensure that the joint control signal generated by the CVAE decoder is as close as possible to the real joint control signal; the KL divergence loss is used to minimize the difference between the latent variable distribution q(z|U,y) and the prior distribution p(z).

[0064] During training, the reconstruction loss and KL divergence loss can be calculated, and the current initial model can be adjusted based on the obtained reconstruction loss and KL divergence loss to obtain the control strategy generation model.

[0065] Joint training can be performed based on the aforementioned sample data. By optimizing the parameters of the SEnet and CVAE models, the model can generate high-precision control signals in different environments. SEnet is responsible for extracting important features from the image data, while the CVAE model generates motion control signals, ensuring that the robot can make optimal decisions based on the current environment and joint states. Multiple sets of experiments can be conducted to verify the effectiveness and robustness of this method in complex environments. The accuracy and performance of the model can be evaluated by measuring the error (e.g., mean square error) between the generated control signals and the real signals.

[0066] To perform the corresponding steps in the above embodiments and various possible methods, an implementation of a robot motion control device 200 is given below. Optionally, the robot motion control device 200 can adopt the device structure of the electronic device 100 shown in FIG1. ​​Further, please refer to FIG5, which is one of the block diagrams of the robot motion control device 200 provided in the embodiments of this application. It should be noted that the basic principle and technical effects of the robot motion control device 200 provided in this embodiment are the same as those in the above embodiments. For the sake of brevity, parts not mentioned in this embodiment can be referred to the corresponding content in the above embodiments. In this embodiment, the robot motion control device 200 may include: a data acquisition module 210, a processing module 220, and a control module 230.

[0067] The acquisition module 210 is used to obtain the first environmental image currently acquired by the target robot and the current position state of the first joint of the target robot.

[0068] The processing module 220 is used to extract environmental features from the first environmental image using a pre-trained target feature extraction network based on channel attention mechanism, so as to obtain the first environmental features.

[0069] The processing module 220 is further configured to input the first environmental features and the first joint position state into the prediction network to obtain an initial motion control signal. The prediction network is trained based on sample environmental features, sample joint position states, and sample motion control signals corresponding to multiple time points, with the sample environmental features and joint position states serving as samples, and the sample motion control signals serving as the labels corresponding to the samples.

[0070] The control module 230 is used to control the movement of the target robot according to the initial motion control signal.

[0071] Please refer to Figure 6, which is a second block diagram of the robot motion control device 200 provided in this embodiment. In this embodiment, the robot motion control device 200 may further include a training module 201.

[0072] The training module 201 is used to: obtain multiple sample data, wherein each sample data includes a sample environment image, sample joint position state, and sample motion control signal corresponding to a time moment; train an initial model based on the multiple sample data to obtain a control strategy generation model, wherein the control strategy generation model includes the target feature extraction network and the prediction network.

[0073] Optionally, the above-mentioned modules can be stored in the memory 110 shown in FIG1 in the form of software or firmware, or embedded in the operating system (OS) of the electronic device 100, and can be executed by the processor 120 in FIG1. ​​At the same time, the data, program code, etc. required to execute the above-mentioned modules can be stored in the memory 110.

[0074] This application also provides a readable storage medium storing a computer program thereon, which, when executed by a processor, implements the robot motion control method described above.

[0075] In summary, this application provides a robot motion control method, device, electronic device, and readable storage medium. First, a first environmental image and the current first joint position state of the target robot are obtained. Then, a pre-trained target feature extraction network based on a channel attention mechanism is used to extract environmental features from the first environmental image to obtain first environmental features. Next, the first environmental features and the first joint position state are input into a prediction network to obtain an initial motion control signal. The prediction network is trained based on sample environmental features, sample joint position states, and sample motion control signals corresponding to multiple time points. The sample environmental features and joint position states are used as samples, and the sample motion control signals are used as labels corresponding to the samples. Finally, the target robot is controlled to move according to the initial motion control signal. In this way, the target feature extraction network can accurately extract important features from the visual input, and then, combined with the joint position state, a signal for motion control of the robot is generated through the prediction network. This improves the adaptability of the robot's visual perception and motion control strategy in complex environments, reduces the mismatch between the robot's actions and the actual environment, and facilitates driving the robot to perform precise motion tasks.

[0076] In the several embodiments provided in this application, it should be understood that the disclosed apparatus and methods can also be implemented in other ways. The apparatus embodiments described above are merely illustrative. For example, the flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods, and computer program products according to various embodiments of this application. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of code containing one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions marked in the blocks may occur in a different order than those marked in the drawings. For example, two consecutive blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in a block diagram and / or flowchart, and combinations of blocks in block diagrams and / or flowcharts, can be implemented using a dedicated hardware-based system that performs the specified function or action, or using a combination of dedicated hardware and computer instructions.

[0077] In addition, the functional modules in the various embodiments of this application can be integrated together to form an independent part, or each module can exist independently, or two or more modules can be integrated to form an independent part.

[0078] If the aforementioned functions are implemented as software functional modules and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or a portion of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0079] The above description is merely an optional embodiment of this application and is not intended to limit this application. Various modifications and variations can be made to this application by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this application should be included within the protection scope of this application.

Claims

1. A robot motion control method, characterized in that, The method includes: Obtain the first environmental image currently acquired by the target robot and the current position state of the first joint of the target robot; The first environmental features are obtained by using a pre-trained target feature extraction network based on channel attention mechanism to extract environmental features from the first environmental image. The first environmental feature and the first joint position state are input into the prediction network to obtain the initial motion control signal. The prediction network is trained based on the sample environmental features, sample joint position states and sample motion control signals corresponding to multiple time points. The sample environmental features and joint position states are used as samples, and the sample motion control signals are used as the labels corresponding to the samples. The target robot is controlled to move according to the initial motion control signal.

2. The method according to claim 1, characterized in that, The prediction network includes an encoder and a decoder. The step of inputting the first environmental features and the first joint position state into the prediction network to obtain an initial motion control signal includes: The encoder is used to generate latent variables based on the first environmental features and the first joint position state. The decoder generates the initial motion control signal based on the generated latent variables.

3. The method according to claim 1, characterized in that, The step of controlling the target robot's movement according to the initial motion control signal includes: Based on multiple historical motion control signals and the initial motion control signal, the target motion control signal is calculated through smoothing processing. The target robot is controlled to move according to the target motion control signal, wherein the joint position state of the target robot at the next moment is the state indicated by the target motion control signal.

4. The method according to claim 1, characterized in that, The method further includes: Multiple sample data points are obtained, each of which includes a sample environment image, sample joint position status, and sample motion control signal at a given time. The initial model is trained based on the multiple sample data to obtain a control strategy generation model, wherein the control strategy generation model includes the target feature extraction network and the prediction network.

5. The method according to claim 4, characterized in that, The prediction network is a conditional variational autoencoder. The step of training the initial model based on the multiple sample data to obtain the control policy generation model includes: During training, the reconstruction loss and KL divergence loss are calculated, and the current initial model is adjusted based on the obtained reconstruction loss and KL divergence loss to obtain the control strategy generation model.

6. The method according to claim 1, characterized in that, The prediction network is a conditional variational autoencoder, and / or the target feature extraction network is an SEnet network or an ECA-Net network.

7. The method according to any one of claims 1-6, characterized in that, The first environmental image and the image corresponding to the sample environmental features both include images obtained by the robot's head camera and / or images obtained by the robot's wrist camera.

8. A robot motion control device, characterized in that, The device includes: The acquisition module is used to obtain the first environmental image currently acquired by the target robot and the current position state of the first joint of the target robot; The processing module is used to extract environmental features from the first environmental image using a pre-trained target feature extraction network based on channel attention mechanism, so as to obtain the first environmental features; The processing module is further configured to input the first environmental features and the first joint position state into the prediction network to obtain an initial motion control signal. The prediction network is trained based on the sample environmental features, sample joint position state and sample motion control signal corresponding to multiple time points. The sample environmental features and joint position state are used as samples, and the sample motion control signal is used as the label corresponding to the sample. The control module is used to control the movement of the target robot according to the initial motion control signal.

9. An electronic device, characterized in that, It includes a processor and a memory, the memory storing machine-executable instructions that can be executed by the processor, the processor executing the machine-executable instructions to implement the robot motion control method according to any one of claims 1-7.

10. A readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by the processor, it implements the robot motion control method as described in any one of claims 1-7.