Imitation learning robot arm grasping method and device based on multi-scale sequence model

By using a multi-scale sequence model imitation learning method, and utilizing a robotic arm hardware platform to collect motion trajectories and visual information, multi-scale feature extraction and integration are performed. This solves the problems of algorithm complexity and robustness in traditional robotic arm grasping tasks, and achieves more efficient motion control and environmental adaptation.

CN116901071BActive Publication Date: 2026-06-19JIANGNAN UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
JIANGNAN UNIV
Filing Date
2023-07-31
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Traditional robotic arm grasping algorithms are complex, and their serial structure leads to poor robustness. Furthermore, imitation learning lacks generalization ability and environmental flexibility.

Method used

An imitation learning method based on multi-scale sequence models is adopted. Motion trajectories and visual information are collected through a robotic arm hardware platform. Attention mechanisms and conditional autoencoders are used to extract and integrate multi-scale features, establish a motion generation strategy model, and achieve end-to-end task training.

Benefits of technology

It improves the continuity and accuracy of robotic arm grasping tasks, enhances the model's generalization ability and environmental adaptability, simplifies the algorithm structure, and improves robustness.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116901071B_ABST
    Figure CN116901071B_ABST
Patent Text Reader

Abstract

This invention discloses a robotic arm grasping method and apparatus based on a multi-scale sequence model using imitation learning. The method includes: collecting disordered grasping robotic arm motion trajectory data through a robotic arm hardware platform to establish a dataset; establishing a sequence multi-scale module based on an attention mechanism to extract multi-scale features from the dataset; integrating the extracted multi-scale features based on a teach-and-learn strategy to obtain a final corrected feature vector; establishing a motion generation strategy model based on a conditional autoencoder to obtain prediction results at different scales; and integrating the prediction results at different scales to obtain the final corrected predicted motion. This invention improves the efficiency of robotic arm motion trajectory planning. Furthermore, the addition of the multi-scale module allows the model to better learn information from different dimensions, increasing the model's generalization ability. Therefore, compared with traditional imitation learning algorithms, this method is more flexible and can better adapt to different control tasks.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The technical field of this invention is the field of automatic grasping technology for robotic arms, and in particular, it relates to a robotic arm grasping method and device based on multi-scale sequence model imitation learning. Background Technology

[0002] Traditional robotic arm grasping tasks typically require object detection and recognition, collision detection and planning, grasping strategy planning, controller design, and feedback control to complete a single task. This not only results in complex algorithms but also leads to poor robustness due to the sequential algorithm structure.

[0003] Reinforcement learning-based strategies address these issues. Using reinforcement learning to perform robotic arm tasks requires only an end-to-end model, making it easier for developers to debug and improve. The main process of building a reinforcement learning model involves establishing the environment and reward function. Guided by the reward function, the model interacts with the environment to continuously improve task performance. However, the setting of the reward function is crucial, and its parameters are highly sensitive. It is difficult to establish a good reward function for complex multi-task or long-scale tasks, and low sample efficiency is also a drawback of reinforcement learning.

[0004] Imitation learning, a branch of reinforcement learning, avoids the aforementioned problems. By establishing a dataset for the target task, it transforms model training into a supervised learning paradigm. Developers no longer need to explicitly define reward functions, and this data is expert data specific to the task, which enables the model to be trained better.

[0005] However, imitation learning also has some drawbacks, such as a lack of generalization ability and a lack of flexibility and adaptability to the environment. As a result, if traditional imitation learning strategies are used for robotic arm operation tasks, the continuity of movements is not good and cumulative errors are easy to occur. Summary of the Invention

[0006] The purpose of this section is to outline some aspects of embodiments of the present invention and to briefly describe some preferred embodiments. Simplifications or omissions may be made in this section, as well as in the abstract and title of this application, to avoid obscuring the purpose of these documents; however, such simplifications or omissions should not be construed as limiting the scope of the invention.

[0007] In view of the above-mentioned problems, the present invention is proposed.

[0008] Therefore, the technical problem solved by this invention is that traditional robotic arm grasping tasks not only have complex algorithms, but also suffer from poor robustness due to their serial algorithm structure.

[0009] To solve the above-mentioned technical problems, the present invention provides the following technical solution:

[0010] In a first aspect, embodiments of the present invention provide a robotic arm grasping method based on a multi-scale sequence model, comprising:

[0011] The robotic arm hardware platform collects unordered grasping motion trajectory data, acquires visual information of the current environment through a camera, and preprocesses the trajectory data and visual information to establish a dataset.

[0012] A sequence multi-scale module based on an attention mechanism is established to extract multi-scale features from the dataset;

[0013] Based on the impartial learning strategy, the extracted multi-scale features are integrated to obtain the final corrected feature vector.

[0014] A conditional autoencoder-based action generation strategy model is established, trained using the final corrected feature vector, and multi-scale feature maps are used as input to obtain prediction results at different scales.

[0015] The prediction results at different scales are integrated to obtain the final corrected prediction action.

[0016] As a preferred approach for imitation learning robotic arm grasping methods based on multi-scale sequence models, where:

[0017] The robotic arm hardware platform collects unordered grasping motion trajectory data, acquires visual information of the current environment through a camera, and preprocesses the trajectory data and visual information to establish a dataset.

[0018] A sequence multi-scale module based on an attention mechanism is established to extract multi-scale features from the dataset;

[0019] Based on the impartial learning strategy, the extracted multi-scale features are integrated to obtain the final corrected feature vector.

[0020] A conditional autoencoder-based action generation strategy model is established, trained using the final corrected feature vector, and multi-scale feature maps are used as input to obtain prediction results at different scales.

[0021] The prediction results at different scales are integrated to obtain the final corrected prediction action.

[0022] As a preferred approach for imitation learning robotic arm grasping methods based on multi-scale sequence models, where:

[0023] The process of collecting disordered grasping robotic arm motion trajectory data through a robotic arm hardware platform and acquiring visual information of the current environment through a camera includes:

[0024] A robotic arm is used to perform a series of tasks in a real environment, recording the robotic arm's motion trajectory data and multi-angle image information of the working environment; the motion trajectory data is preprocessed, and a dataset is constructed within a traditional imitation learning algorithm framework; specifically, the composition of the state space is determined, which includes visual, force, and robotic arm joint posture information; visual information I v Captured by a camera, with dimensions H*W, robotic arm joint pose information I p It consists of position information, velocity information, torque information, acceleration information, etc.; using relevant sensors, the environmental and motion information of the robotic arm is recorded in real time to generate trajectory state data.

[0025] As a preferred approach for imitation learning robotic arm grasping methods based on multi-scale sequence models, where:

[0026] The establishment of the attention-based sequence multi-scale module for multi-scale feature extraction of the dataset includes:

[0027] When extracting multi-scale features from trajectory data, the N*P*T sequence of robotic arm joint posture information is divided into time slices according to different time scales. The selection of time slices starts with the feature at the current time node as the feature endpoint, expanding to the left. If there is insufficient data on the left, empty data is filled in to obtain time slice data of a specified length. The original robotic arm joint posture information is preprocessed. After the forward propagation process of the time series module, the multi-scale features of the robotic arm joint posture information are extracted. Through the processing of multi-scale feature information, feature vectors that can represent image features are generated. The final feature representation is used as the input of the control strategy algorithm for robotic arm control.

[0028] As a preferred approach for imitation learning robotic arm grasping methods based on multi-scale sequence models, where:

[0029] The step of establishing a sequence multi-scale module based on an attention mechanism to extract multi-scale features from the dataset further includes:

[0030] When performing multi-scale feature extraction on visual information, the H×W×T sequence image information is divided into time slices according to different time scales. The selection of time slices starts with the feature at the current time node as the feature endpoint and expands to the left. If there is not enough data on the left, empty data is filled in to obtain time slice data of a specified length. The original image is resampled to generate images at different scales, forming an image pyramid. Convolutional neural networks are used to extract features from the images in the image pyramid. Through the processing of multi-scale feature information, feature vectors that can represent image features are generated.

[0031] As a preferred approach for imitation learning robotic arm grasping methods based on multi-scale sequence models, where:

[0032] The establishment of the action generation strategy model based on conditional autoencoder includes:

[0033] A conditional autoencoder (CAE) model is designed for imitation learning and motion control. The CAE model consists of an encoder and a decoder. The encoder takes an action sequence and camera image as input and maps them to latent variables in a latent space. The decoder reconstructs the action sequence and camera image from the latent variables in the latent space. The CAE model is trained using the final corrected feature vector. During training, the model is optimized by minimizing the reconstruction error and the KL divergence of the latent variables. The reconstruction error measures the difference between the reconstructed action sequence and camera image and the original input. After model training, the model can be used for motion control of a robotic arm. Given a target action sequence and camera image, the encoder maps them to latent variables in the latent space. Depending on the task objective, the latent variables are modified in the latent space, and then the decoder decodes the modified latent variables into the output of motion control.

[0034] As a preferred approach for imitation learning robotic arm grasping methods based on multi-scale sequence models, where:

[0035] The process of integrating prediction results at different scales to obtain the final corrected prediction includes:

[0036] The feature maps at the first and second time scales are used as inputs to a feature integration network based on impartial learning to obtain the predicted value θ, where θ is the impartial rate of feature learning at different scales.

[0037] θ is a 1×3 matrix, representing Ar1 obtained by imparting knowledge from A2 to A1, Ar2 obtained by imparting knowledge from A3 to Ar1, and Ar3 obtained by imparting knowledge from A4 to Ar2.

[0038] Where A1, A2, A3, and A4 are predictions of the action at the next moment at different scales, and Ar1, Ar2, and Ar3 are the correction values ​​of the action results after three knowledge transfers.

[0039] The knowledge transfer process uses linear interpolation, which can be represented as:

[0040] Ar1 = A1 + θ1(A2 - A1)

[0041] Ar2 = Ar1 + θ2(A3 - A2)

[0042] Ar3 = Ar2 + θ3(A4 - A3)

[0043] The obtained Ar3 feature output is mapped to the actual physical action space, and clipped to obtain the final action output, represented as:

[0044] Action = clip(map(Ar3)).

[0045] Secondly, embodiments of the present invention provide a robotic arm grasping system based on a multi-scale sequence model for imitation learning, characterized in that it includes:

[0046] The preprocessing module is used to collect the disordered grasping motion trajectory data of the robotic arm through the robotic arm hardware platform, acquire visual information of the current environment through the camera, and preprocess the trajectory data and visual information to establish a dataset.

[0047] The extraction module is used to establish a sequence multi-scale module based on an attention mechanism to extract multi-scale features from the dataset.

[0048] The integration module is used to integrate the extracted multi-scale features based on the impartial learning strategy to obtain the final corrected feature vector.

[0049] The action generation module is used to establish an action generation strategy model based on a conditional autoencoder. It is trained using the final corrected feature vector, and multi-scale feature maps are used as input to obtain prediction results at different scales. The prediction results at different scales are integrated to obtain the final corrected predicted action.

[0050] Thirdly, embodiments of the present invention provide a computing device, including:

[0051] Memory and processor;

[0052] The memory is used to store computer-executable instructions, and the processor is used to execute the computer-executable instructions. When the one or more programs are executed by the one or more processors, the one or more processors implement the imitation learning robotic arm grasping method based on a multi-scale sequence model as described in any embodiment of the present invention.

[0053] Fourthly, embodiments of the present invention provide a computer-readable storage medium storing computer-executable instructions, which, when executed by a processor, implement the aforementioned imitation learning robotic arm grasping method based on a multi-scale sequence model.

[0054] The beneficial effects of this invention are as follows: This invention utilizes decision sequences to extract and integrate features across multiple scales, balancing the continuity of actions over long time scales with the accuracy of actions over short time scales, thereby improving the efficiency of robotic arm trajectory planning. Simultaneously, the inclusion of multi-scale modules allows the model to better learn information from different dimensions, increasing the model's generalization ability. Therefore, compared to traditional imitation learning algorithms, this method is more flexible and better adaptable to different control tasks. Attached Figure Description

[0055] To more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort. Wherein:

[0056] Figure 1 This is an overall flowchart of the imitation learning robotic arm grasping method based on a multi-scale sequence model as described in the first embodiment of the present invention;

[0057] Figure 2 This is a schematic diagram of the multi-scale extraction of visual information in the imitation learning robotic arm grasping method based on a multi-scale sequence model as described in the first embodiment of the present invention.

[0058] Figure 3 This is the overall architecture diagram of the multi-scale image feature extraction method for the imitation learning robotic arm based on a multi-scale sequence model as described in the first embodiment of the present invention;

[0059] Figure 4 This is a schematic diagram illustrating the teaching and learning process in a simulation example of the imitation learning robotic arm grasping method based on a multi-scale sequence model as described in the second embodiment of the present invention.

[0060] Figure 5 This is a schematic diagram of the encoder structure of the action generation strategy model of the conditional autoencoder in the simulation example of the imitation learning robotic arm grasping method based on multi-scale sequence model described in the second embodiment of the present invention.

[0061] Figure 6 This is a schematic diagram of the decoder structure of the action generation strategy model of the conditional autoencoder in a simulation example of the imitation learning robotic arm grasping method based on a multi-scale sequence model as described in the second embodiment of the present invention.

[0062] Figure 7 This is a schematic diagram comparing the training process loss values ​​of the robotic arm grasping method based on multi-scale sequence models with those of the original algorithm in a simulation example of the method described in the second embodiment of the present invention, after adding a multi-scale module. Detailed Implementation

[0063] To make the above-mentioned objects, features, and advantages of the present invention more apparent and understandable, specific embodiments of the present invention will be described in detail below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of the present invention, and not all of them. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort should fall within the protection scope of the present invention.

[0064] Many specific details are set forth in the following description in order to provide a full understanding of the invention. However, the invention may also be practiced in other ways different from those described herein, and those skilled in the art can make similar extensions without departing from the spirit of the invention. Therefore, the invention is not limited to the specific embodiments disclosed below.

[0065] Secondly, the term "one embodiment" or "embodiment" as used herein refers to a specific feature, structure, or characteristic that may be included in at least one implementation of the present invention. The phrase "in one embodiment" appearing in different places in this specification does not necessarily refer to the same embodiment, nor is it a single or selective embodiment that is mutually exclusive with other embodiments.

[0066] This invention is described in detail with reference to the schematic diagrams. When detailing the embodiments of this invention, for ease of explanation, the cross-sectional views illustrating the device structure may be partially enlarged, not adhering to the usual scale. Furthermore, the schematic diagrams are merely examples and should not be construed as limiting the scope of protection of this invention. In actual fabrication, the three-dimensional spatial dimensions of length, width, and depth should be included.

[0067] Furthermore, in the description of this invention, it should be noted that the terms "upper," "lower," "inner," and "outer," etc., indicate the orientation or positional relationship based on the orientation or positional relationship shown in the accompanying drawings. These terms are used solely for the convenience of describing the invention and for simplifying the description, and do not indicate or imply that the device or element referred to must have a specific orientation, or be constructed and operated in a specific orientation. Therefore, they should not be construed as limitations on the invention. In addition, the terms "first," "second," or "third" are used for descriptive purposes only and should not be construed as indicating or implying relative importance.

[0068] Unless otherwise explicitly specified and limited, the terms "installation," "connection," and "joining" in this invention should be interpreted broadly. For example, they can refer to fixed connections, detachable connections, or integral connections; similarly, they can refer to mechanical connections, electrical connections, or direct connections, or indirect connections through an intermediate medium, or internal connections between two components. Those skilled in the art can understand the specific meaning of the above terms in this invention based on the specific circumstances.

[0069] Example 1

[0070] Reference Figures 1-3This is the first embodiment of the present invention, which provides a robotic arm grasping method based on multi-scale sequence models, including:

[0071] S1: Collect unordered grasping robot arm motion trajectory data through the robot arm hardware platform, acquire visual information of the current environment through the camera, and preprocess the trajectory data and visual information to establish a dataset;

[0072] Specifically, the process of collecting unordered grasping robot arm motion trajectory data through the robot arm hardware platform and acquiring visual information of the current environment through a camera includes:

[0073] A robotic arm is used to perform a series of tasks in a real environment, recording the robotic arm's motion trajectory data and multi-angle image information of the working environment; the motion trajectory data is preprocessed, and a dataset is constructed within a traditional imitation learning algorithm framework; specifically, the composition of the state space is determined, which includes visual, force, and robotic arm joint posture information; visual information I v Captured by a camera, with dimensions H×W, robotic arm joint pose information I p It consists of position information, velocity information, torque information, acceleration information, etc.; using relevant sensors, the environmental and motion information of the robotic arm is recorded in real time to generate trajectory state data.

[0074] S2: Establish a sequence multi-scale module based on an attention mechanism to extract multi-scale features from the dataset;

[0075] Specifically, the establishment of an attention-based sequence multi-scale module for multi-scale feature extraction of the dataset includes:

[0076] When extracting multi-scale features from trajectory data, the N×P×T sequence of robotic arm joint posture information is divided into time slices according to different time scales. The selection of time slices starts with the feature at the current time node as the feature endpoint, and expands to the left. If there is not enough data on the left, empty data is filled in to obtain time slice data of a specified length. The original robotic arm joint posture information is preprocessed. After the forward propagation process of the time series module, the multi-scale features of the robotic arm joint posture information are extracted. Through the processing of multi-scale feature information, feature vectors that can represent image features are generated. The final feature representation is used as the input of the control strategy algorithm for robotic arm control.

[0077] Furthermore, when extracting multi-scale features from visual information, the H×W×T sequence image information is divided into time slices according to different time scales. The selection of time slices starts with the feature at the current time node as the feature endpoint and expands to the left. If there is not enough data on the left, empty data is filled in to obtain time slice data of a specified length. The original image is resampled to generate images at different scales, forming an image pyramid. Convolutional neural networks are used to extract features from the images in the image pyramid. Through the processing of multi-scale feature information, feature vectors that can represent image features are generated.

[0078] It should be noted that this invention utilizes an attention mechanism to extract features from image information at multiple scales in order to obtain better target detection results, and utilizes a time series model to extract features from the joint posture information of the robotic arm at multiple scales in order to improve the continuity and stability of the robotic arm's movements.

[0079] S3: Based on the impartial learning strategy, the extracted multi-scale features are integrated to obtain the final corrected feature vector;

[0080] Specifically, the process of integrating the extracted multi-scale features to obtain the final corrected feature vector includes:

[0081] The extracted multi-scale features are represented in the following form:

[0082]

[0083] Where s t a t (t = 1, 2, ..., T) represent the state and action, respectively. It is the sum of all rewards for the trajectory after time t, defined as follows:

[0084]

[0085] trajectory Position encoding is performed, which is only used to distinguish different time points. Elements at the same time point share the same position encoding. After processing, feature vectors at different time scales are obtained.

[0086] S4: Establish an action generation strategy model based on a conditional autoencoder, train it using the final corrected feature vector, take multi-scale feature maps as input to obtain prediction results at different scales, integrate the prediction results at different scales to obtain the final corrected predicted action.

[0087] Specifically, the establishment of the action generation strategy model based on conditional autoencoders includes:

[0088] A conditional autoencoder (CAE) model is designed for imitation learning and motion control. The CAE model consists of an encoder and a decoder. The encoder takes an action sequence and camera image as input and maps them to latent variables in a latent space. The decoder reconstructs the action sequence and camera image from the latent variables in the latent space. The CAE model is trained using the final corrected feature vector. During training, the model is optimized by minimizing the reconstruction error and the KL divergence of the latent variables. The reconstruction error measures the difference between the reconstructed action sequence and camera image and the original input. After model training, the model can be used for motion control of a robotic arm. Given a target action sequence and camera image, the encoder maps them to latent variables in the latent space. Depending on the task objective, the latent variables are modified in the latent space, and then the decoder decodes the modified latent variables into the output of motion control.

[0089] Furthermore, the process of integrating prediction results from different scales to obtain the final corrected prediction includes:

[0090] The feature maps at the first and second time scales are used as inputs to a feature integration network based on impartial learning to obtain the predicted value θ, where θ is the impartial rate of feature learning at different scales.

[0091] θ is a 1×3 matrix, representing Ar1 obtained by imparting knowledge from A2 to A1, Ar2 obtained by imparting knowledge from A3 to Ar1, and Ar3 obtained by imparting knowledge from A4 to Ar2.

[0092] Where A1, A2, A3, and A4 are predictions of the action at the next moment at different scales, and Ar1, Ar2, and Ar3 are the correction values ​​of the action results after three knowledge transfers.

[0093] The knowledge transfer process uses linear interpolation, which can be represented as:

[0094] Ar1 = A1 + θ1(A2 - A1)

[0095] Ar2 = Ar1 + θ2(A3 - A2)

[0096] Ar3 = Ar2 + θ3(A4 - A3)

[0097] The obtained Ar3 feature output is mapped to the actual physical action space, and clipped to obtain the final action output, represented as:

[0098] Action = clip(map(Ar3)).

[0099] It should be noted that this invention designs a CVAE model for imitation learning and motion control. The CVAE consists of an encoder and a decoder. The encoder takes the action sequence and camera images as input and maps them to latent variables in a latent space. The decoder reconstructs the action sequence and camera images from the latent variables in the latent space. Note that the sampling of the latent variables should take into account the conditional information of the input.

[0100] The CVAE model was trained using a preprocessed dataset. During training, the model was optimized by minimizing the reconstruction error and the KL divergence of the latent variables. The reconstruction error measures the difference between the reconstructed action sequence and camera image and the original input, while the KL divergence measures the difference between the distribution in the latent space and the standard normal distribution.

[0101] After model training is complete, the CVAE model can be used for motion control of the robotic arm. Given a target action sequence and camera images, an encoder maps them to latent variables in a latent space. Then, according to the required task objectives, the latent variables in the latent space can be modified, and a decoder decodes the modified latent variables into motion control outputs, i.e., control commands for the robotic arm. In this way, the robotic arm can perform corresponding motion control based on the input action sequence and camera image information.

[0102] The model is evaluated and tuned as needed. Its performance and accuracy can be assessed by comparing it to real robotic arm control. Based on the evaluation results, the model can be improved and optimized, such as adjusting the model architecture, increasing the amount of training data, or adjusting hyperparameters.

[0103] Through the above steps, CVAE (Continuous Vibration and Image Processing) is used for imitation learning, applying motion sequences and multi-view camera image information to the motion control of the robotic arm. This method enables the robotic arm to generate corresponding motion control commands based on the input motion sequence and camera image information, achieving imitation and adaptive motion capabilities.

[0104] The above is an illustrative scheme of a robotic arm grasping method based on multi-scale sequence model imitation learning according to this embodiment. It should be noted that the technical solution of the robotic arm grasping system based on multi-scale sequence model imitation learning is based on the same concept as the above-described robotic arm grasping method based on multi-scale sequence model imitation learning. Details not described in detail in the technical solution of the robotic arm grasping system based on multi-scale sequence model imitation learning in this embodiment can be found in the description of the above-described robotic arm grasping method based on multi-scale sequence model imitation learning.

[0105] The robotic arm grasping system based on multi-scale sequence model imitation learning in this embodiment is characterized by comprising:

[0106] The preprocessing module is used to collect the disordered grasping motion trajectory data of the robotic arm through the robotic arm hardware platform, acquire visual information of the current environment through the camera, and preprocess the trajectory data and visual information to establish a dataset.

[0107] The extraction module is used to establish a sequence multi-scale module based on an attention mechanism to extract multi-scale features from the dataset.

[0108] The integration module is used to integrate the extracted multi-scale features based on the impartial learning strategy to obtain the final corrected feature vector.

[0109] The action generation module is used to establish an action generation strategy model based on a conditional autoencoder. It is trained using the final corrected feature vector, and multi-scale feature maps are used as input to obtain prediction results at different scales. The prediction results at different scales are integrated to obtain the final corrected predicted action.

[0110] This embodiment also provides a computing device suitable for imitation learning robotic arm grasping methods based on multi-scale sequence models, including:

[0111] The system includes a memory and a processor. The memory stores computer-executable instructions, and the processor executes these instructions to implement the imitation learning robotic arm grasping method based on a multi-scale sequence model as proposed in the above embodiments.

[0112] This embodiment also provides a storage medium on which a computer program is stored. When the program is executed by a processor, it implements the imitation learning robotic arm grasping method based on a multi-scale sequence model as proposed in the above embodiments.

[0113] The storage medium proposed in this embodiment belongs to the same inventive concept as the imitation learning robotic arm grasping method based on multi-scale sequence models proposed in the above embodiments. Technical details not described in detail in this embodiment can be found in the above embodiments, and this embodiment has the same beneficial effects as the above embodiments.

[0114] Example 2

[0115] Reference Figures 4-7 As an embodiment of the present invention, a method for robotic arm grasping based on multi-scale sequence model imitation learning is provided. To verify the beneficial effects of the present invention, a simulation experiment is conducted for scientific demonstration.

[0116] Generate training data for motion trajectories: Use a robotic arm to perform a series of tasks in a real environment, recording the robotic arm's motion trajectory data and multi-angle image information of the working environment. Preprocess the motion trajectory data and establish a database. Obtain the raw training data.

[0117] Multi-scale feature extraction: Using an attention-based multi-scale module, multi-scale features are extracted from the original training data to obtain a multi-dimensional feature map.

[0118] In this embodiment, when the H×W×T sequence image information is divided into time slices according to different time scales, feature extraction is performed at four time scales, with time slice lengths of s1=1, s2=10, s3=50, and s4=100. The selection of time slices starts with the feature at the current time node as the feature endpoint and expands to the left (filling in empty data if there is not enough data on the left) to obtain time slice data of a specified length.

[0119] Feature integration based on instructional learning: Based on the instructional learning strategy, features at multiple scales are integrated to obtain the final corrected feature vector.

[0120] Decision sequence model training: Establish an action generation policy model based on conditional autoencoder, train it using the above features, and obtain the final prediction result. Use multi-scale feature maps as input to obtain prediction results in different dimensions.

[0121] By utilizing the concept of instructional learning, prediction results at different scales are integrated to obtain the final revised prediction action.

[0122] like Figure 7 As shown, the loss function graphs of the training process of this algorithm and the traditional imitation learning algorithm are displayed. It can be seen that the convergence speed of this algorithm is faster and the loss function is smaller. In 1000 training cycles, not only is the loss function value (about 0.05) one-tenth of that of the traditional algorithm (about 0.5), but the fluctuation range is also extremely small.

[0123] It should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention and are not intended to limit it. Although the present invention has been described in detail with reference to preferred embodiments, those skilled in the art should understand that modifications or equivalent substitutions can be made to the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, and all such modifications or substitutions should be covered within the scope of the claims of the present invention.

Claims

1. A multi-scale sequence model-based imitation learning robot grasping method, characterized in that, include: The robotic arm hardware platform collects unordered grasping motion trajectory data, acquires visual information of the current environment through a camera, and preprocesses the trajectory data and visual information to establish a dataset. A sequence multi-scale module based on an attention mechanism is established to extract multi-scale features from the dataset; Based on the impartial learning strategy, the extracted multi-scale features are integrated to obtain the final corrected feature vector. A conditional autoencoder-based action generation strategy model is established, trained using the final corrected feature vector, and multi-scale feature maps are used as input to obtain prediction results at different scales. The prediction results at different scales are integrated to obtain the final corrected prediction action.

2. The multi-scale sequence model based imitation learning robot grasping method of claim 1, wherein, The process of collecting disordered grasping robotic arm motion trajectory data through a robotic arm hardware platform and acquiring visual information of the current environment through a camera includes: A robotic arm is used to perform a series of tasks in a real environment, recording the robotic arm's motion trajectory data and multi-angle image information of the working environment; the motion trajectory data is preprocessed, and a dataset is constructed within a traditional imitation learning algorithm framework; specifically, the composition of the state space is determined, which includes visual, force, and robotic arm joint posture information; visual information I v Captured by a camera, with dimensions H*W, robotic arm joint pose information I p It consists of position information, velocity information, torque information, and acceleration information; using relevant sensors, it records the environmental and motion information of the robotic arm in real time and generates trajectory state data.

3. The multi-scale sequence model based imitation learning robot grasping method of claim 2, wherein, The establishment of the attention-based sequence multi-scale module for multi-scale feature extraction of the dataset includes: When extracting multi-scale features from trajectory data, the N * P * T sequence of robotic arm joint posture information is divided into time slices according to different time scales. The selection of time slices starts with the feature at the current time node as the feature endpoint, and expands to the left. If there is not enough data on the left, empty data is filled in to obtain time slice data of a specified length. The original robotic arm joint posture information is preprocessed. After the forward propagation process of the time series module, the multi-scale features of the robotic arm joint posture information are extracted. Through the processing of multi-scale feature information, feature vectors that can represent image features are generated. The final feature representation is used as the input of the control strategy algorithm for robotic arm control.

4. The imitation learning robotic arm grasping method based on a multi-scale sequence model as described in claim 3, characterized in that, The step of establishing a sequence multi-scale module based on an attention mechanism to extract multi-scale features from the dataset further includes: When performing multi-scale feature extraction on visual information, the H×W×T sequence image information is divided into time slices according to different time scales. The selection of time slices starts with the feature at the current time node as the feature endpoint and expands to the left. If there is not enough data on the left, empty data is filled in to obtain time slice data of a specified length. The original image is resampled to generate images at different scales, forming an image pyramid. Convolutional neural networks are used to extract features from the images in the image pyramid. Through the processing of multi-scale feature information, feature vectors that can represent image features are generated.

5. The multi-scale sequence model based imitation learning robot grasping method of claim 4, wherein, The process of integrating the extracted multi-scale features to obtain the final corrected feature vector includes: The extracted multi-scale features are represented in the following form: in They are state and action, respectively. It is the sum of all rewards for the trajectory after time t, defined as follows: Trajectory Position encoding is performed, where position encoding is only used to distinguish different time instants, and elements at the same time instant share one position encoding. After processing, the feature vectors at different time scales are obtained.

6. The multi-scale sequence model based imitation learning robot grasping method of claim 5, wherein, The establishment of the action generation strategy model based on conditional autoencoder includes: A conditional autoencoder (CAE) model is designed for imitation learning and motion control. The CAE model consists of an encoder and a decoder. The encoder takes an action sequence and camera image as input and maps them to latent variables in a latent space. The decoder reconstructs the action sequence and camera image from the latent variables in the latent space. The CAE model is trained using the final corrected feature vector. During training, the model is optimized by minimizing the reconstruction error and the KL divergence of the latent variables. The reconstruction error measures the difference between the reconstructed action sequence and camera image and the original input. After model training, the model can be used for motion control of a robotic arm. Given a target action sequence and camera image, the encoder maps them to latent variables in the latent space. Depending on the task objective, the latent variables are modified in the latent space, and then the decoder decodes the modified latent variables into the output of motion control.

7. The imitation learning robotic arm grasping method based on a multi-scale sequence model as described in claim 6, characterized in that, The process of integrating prediction results at different scales to obtain the final corrected prediction includes: The feature maps at the first and second time scales are used as inputs to a feature integration network based on impartial learning to obtain the predicted value θ, where θ is the impartial rate of feature learning at different scales. θ is a 1×3 matrix, representing Ar1 obtained by imparting knowledge from A2 to A1, Ar2 obtained by imparting knowledge from A3 to Ar1, and Ar3 obtained by imparting knowledge from A4 to Ar2. Where A1, A2, A3, and A4 are the predictions of the action at the next moment at different scales, and Ar1, Ar2, and Ar3 are the correction values ​​of the action results after three knowledge transfers. The knowledge transfer process uses linear interpolation, which can be represented as: Ar1 = A1 + θ1(A2 - A1) Ar2 = Ar1 + θ2(A3 - A2) Ar3 = Ar2 + θ3(A4 - A3) The obtained Ar3 feature output is mapped to the actual physical action space, and clipped to obtain the final action output, represented as: Action = clip(map (Ar3)).

8. A robotic arm grasping system based on multi-scale sequence model imitation learning, characterized in that, include: The preprocessing module is used to collect the disordered grasping motion trajectory data of the robotic arm through the robotic arm hardware platform, acquire visual information of the current environment through the camera, and preprocess the trajectory data and visual information to establish a dataset. The extraction module is used to establish a sequence multi-scale module based on an attention mechanism to extract multi-scale features from the dataset. The integration module is used to integrate the extracted multi-scale features based on the impartial learning strategy to obtain the final corrected feature vector. The action generation module is used to build an action generation strategy model based on a conditional autoencoder. It is trained using the final corrected feature vector and takes multi-scale feature maps as input to obtain prediction results at different scales. The prediction results at different scales are integrated to obtain the final corrected prediction action.

9. A computing device, comprising: Memory and processor; The memory is used to store computer-executable instructions, and the processor is used to execute the computer-executable instructions. When the computer-executable instructions are executed by the processor, they implement the steps of the imitation learning robotic arm grasping method based on the multi-scale sequence model as described in any one of claims 1 to 7.

10. A computer-readable storage medium storing computer-executable instructions that, when executed by a processor, implement the steps of the imitation learning robotic arm grasping method based on a multi-scale sequence model as described in any one of claims 1 to 7.