Robot grasping prediction method based on attention mechanism and low-rank tensor fusion

By extracting and fusing tactile sensor features through attention mechanisms and low-rank tensor fusion methods, the problem of insufficient feature extraction and modality fusion in robot grasping prediction is solved, and more efficient grasping prediction results are achieved.

CN118003326BActive Publication Date: 2026-06-30SHENZHEN INST OF ADVANCED TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SHENZHEN INST OF ADVANCED TECH
Filing Date
2024-03-04
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Current technologies lack effective extraction and modal fusion of tactile modal features in predicting whether a robot will successfully grasp an object, resulting in low prediction accuracy.

Method used

We employ an attention mechanism and low-rank tensor fusion approach. We preprocess the robot grasping image dataset, extract features using an attention mechanism model network, and fuse the features from two tactile sensors in a multimodal manner using a low-rank tensor fusion module. Finally, a classifier outputs a prediction of whether the grasping was successful.

Benefits of technology

It improves the accuracy and robustness of robot grasping prediction, enabling it to better capture key features and important information, generate richer fusion features, and enhance grasping precision.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN118003326B_ABST
    Figure CN118003326B_ABST
Patent Text Reader

Abstract

This invention discloses a robot grasping prediction method based on attention mechanism and low-rank tensor fusion. The method includes: acquiring a dataset of grasping images of a robot grasping multiple different objects; preprocessing the grasping image dataset and dividing it into a training set and a test set; training and testing a model based on attention mechanism and low-rank tensor fusion using the training set and the test set to obtain a trained model; acquiring tactile images of the target object grasped by the robot; inputting the tactile images into the trained model; and outputting a prediction result of whether the target object was successfully grasped. This invention enables the model to better capture key features and important information, achieves more comprehensive and efficient feature extraction by fusing multimodal information, generates richer fused features, and effectively predicts whether grasping is successful.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of robotics, and in particular to a robot grasping prediction method, system, terminal, and computer-readable storage medium based on attention mechanism and low-rank tensor fusion. Background Technology

[0002] Machine grasping tasks are a crucial component of robotic applications, and their successful execution relies on accurate environmental perception and precise extraction of object features. While significant progress has been made in haptic-based applications, these studies have unfortunately not focused adequately on extracting haptic modal features. Due to the lack of effective extraction of these features, systems may struggle to capture crucial haptic information, potentially leading to the omission of important features in the environment or task. Furthermore, the lack of effective modal fusion methods may prevent systems from fully utilizing information from multiple haptic sensors. This situation may limit the system's ability to comprehensively understand its surroundings, thus affecting its adaptability and robustness in complex environments.

[0003] With the widespread application of automation across various fields, robots not only need to perform tasks but also need to accurately grasp and manipulate objects. This has driven research into stable grasping, enabling robots to predict and achieve stable grasping actions in different scenarios. Robots can acquire environmental information through visual perception technology and then use advanced learning algorithms to predict the position, shape, and texture of objects, providing accurate input for stable grasping. This technological advancement endows robots with higher perception and intelligent decision-making capabilities. The significance of predictive stable grasping lies in improving the robot's operational accuracy and success rate. It not only enables robots to perform tasks in complex environments but also increases their adaptability when handling objects of different shapes, sizes, and textures. In industrial production, warehouse management, and service sectors, this technology helps improve the efficiency of automated systems, reduce errors and damage, and enhance the reliability of overall workflows.

[0004] For example, existing object grasping methods based on temporal tactile data processing first determine the optimal grasping area using the object's position information. Then, when the robotic arm reaches the optimal grasping area, it uses a preset force to close the grasp and collects data from tactile sensors. This temporal tactile data is converted into tactile images and repeatedly transmitted using a pre-trained network to predict future temporal tactile images. Finally, the predicted frame sequence is input into an LSTM classification network, which outputs a classification result indicating whether the grasping is stable, thereby enabling automatic control of the robotic arm to grasp or release objects.

[0005] For example, one improved slide detection method involves capturing multiple consecutive frames of visual-tactile detection images in real time using a binocular camera when the visual-tactile sensor comes into contact with the target object. These images are then processed using an edge feature extraction algorithm to obtain the edge point regions of each frame. Slide detection results are generated based on the overlap, grayscale parameter continuity, and depth parameter continuity of these regions. This method improves the accuracy and robustness of slide detection and avoids the problem of relying solely on tactile data.

[0006] For example, a robot grasping method based on multi-source information fusion first acquires the RGB image, optical flow image, and depth image of the object to be grasped, and extracts features from them. Then, a multi-source information fusion module fuses the RGB features, optical flow features, and depth features to obtain the fused features of the object to be grasped. Next, the fused features are input into an object pose prediction module for classification and grasping position prediction. Based on the predicted grasping position information, the grasping action is executed. Tactile information is perceived through a tactile sensor to determine whether the grasping was successful.

[0007] While the above approaches propose methods to predict grasping success from different perspectives, they haven't adequately focused on extracting tactile modal features. This lack of effective feature extraction may prevent the system from capturing crucial tactile information, potentially leading to the omission of important features in the environment or task. Furthermore, the absence of effective modal fusion methods means the system may not fully utilize information from multiple tactile sensors. This could limit the system's ability to comprehensively understand its surroundings, impacting its adaptability and robustness in complex environments.

[0008] Therefore, existing technologies still need to be improved and developed. Summary of the Invention

[0009] The main objective of this invention is to provide a robot grasping prediction method, system, terminal, and computer-readable storage medium based on attention mechanism and low-rank tensor fusion. This invention aims to solve the problems in the prior art where there is a lack of effective feature extraction when predicting whether a robot has successfully grasped an object, and the lack of effective modal fusion of information provided by tactile sensors, resulting in low accuracy in predicting whether a robot has successfully grasped an object.

[0010] To achieve the above objectives, this invention provides a robot grasping prediction method based on attention mechanism and low-rank tensor fusion, which includes the following steps:

[0011] A dataset of images of a robot grasping multiple different objects is obtained, and the grasping image dataset is preprocessed and divided into a training set and a test set.

[0012] The attention mechanism and low-rank tensor fusion model is trained and tested using the training set and the test set to obtain a trained attention mechanism and low-rank tensor fusion model. The attention mechanism and low-rank tensor fusion model includes an attention mechanism model network, a tactile feature fusion module, and a classifier.

[0013] The robot acquires tactile images of the target object it grasps, inputs these images into a trained model based on attention mechanism and low-rank tensor fusion, and outputs a prediction of whether the target object has been successfully grasped.

[0014] Optionally, the robot grasping prediction method based on attention mechanism and low-rank tensor fusion, wherein acquiring a grasping image dataset of multiple different objects by the robot, and dividing the grasping image dataset into a training set and a test set after preprocessing, specifically includes:

[0015] The robot acquires a dataset of images of multiple objects grasped by its robotic arm, parallel gripper, two photoelectric tactile sensors, and two depth camera sensors.

[0016] The captured image dataset is preprocessed, including random horizontal flipping, random cropping, and normalization. The preprocessed captured image dataset is then divided proportionally to obtain a training set and a test set.

[0017] Optionally, in the robot grasping prediction method based on attention mechanism and low-rank tensor fusion, the acquisition process of the grasping image dataset includes:

[0018] At time Ta, obtain the initial state diagram of the object before it is grasped;

[0019] At time Tb, acquire the contact state of the grasped object and measure the contact state diagram when the gripper completes finger closure and the object is still on the ground;

[0020] At time Tc, acquire the grabbing state diagram of the grabbed object successfully staying in the air for a preset time;

[0021] Whether the object can be successfully captured at time Tc is used as the label information of the experiment.

[0022] Optionally, the robot grasping prediction method based on attention mechanism and low-rank tensor fusion, wherein the step of training and testing the attention mechanism and low-rank tensor fusion model using the training set and the test set to obtain a trained attention mechanism and low-rank tensor fusion model specifically includes:

[0023] The attention-based model network extracts features from the tactile images in the test set, and obtains a one-dimensional feature vector through global average pooling. The one-dimensional feature vector is then input into the tactile feature fusion module.

[0024] The tactile feature fusion module performs low-rank tensor multimodal fusion of the features of the left tactile sensor and the features of the right tactile sensor in the one-dimensional feature vector to obtain fused features, and inputs the fused features into the classifier;

[0025] The classifier classifies the fused features and outputs a prediction result of whether the object was successfully captured.

[0026] Once all tactile images in the test set have been predicted, the test set is input into the attention-based and low-rank tensor fusion model for testing. If the test results meet the preset requirements, the trained attention-based and low-rank tensor fusion model is obtained.

[0027] Optionally, the robot grasping prediction method based on attention mechanism and low-rank tensor fusion includes an attention mechanism-based model network comprising four attention residual blocks, each of which includes a convolutional layer and an attention module.

[0028] The tactile images in the test set are input into the attention-based model network, where features are extracted through four attention residual blocks and then obtained as a one-dimensional feature vector through global average pooling.

[0029] Optionally, in the robot grasping prediction method based on attention mechanism and low-rank tensor fusion, the one-dimensional feature vector includes the feature Z of the left tactile sensor. L The Z-shaped feature of the right-side tactile sensor R ;

[0030] The tactile feature fusion module integrates the features Z of the left tactile sensor. L The Z-shaped feature of the right-side tactile sensor R Low-rank tensor multimodal fusion is performed to obtain the fused feature F:

[0031]

[0032] in, This represents the low-rank factor corresponding to the left tactile sensor mode. is the low-rank factor corresponding to the right tactile sensor mode, o is the element-wise multiplication of the vector, and r is the rank of the vector.

[0033] Optionally, in the robot grasping prediction method based on attention mechanism and low-rank tensor fusion, the classifier includes a first fully connected layer and a second fully connected layer.

[0034] The fused feature F is input into a classifier, which performs classification processing on the fused feature F through the first fully connected layer and the second fully connected layer, and outputs a prediction result of whether the object is successfully grasped.

[0035] Furthermore, to achieve the above objectives, the present invention also provides a robot grasping prediction system based on attention mechanism and low-rank tensor fusion, wherein the robot grasping prediction system based on attention mechanism and low-rank tensor fusion includes:

[0036] The data acquisition module is used to acquire a dataset of images of the robot grasping multiple different objects, and to preprocess the grasping image dataset into a training set and a test set.

[0037] The model building module is used to train and test the attention mechanism and low-rank tensor fusion model using the training set and the test set to obtain the trained attention mechanism and low-rank tensor fusion model. The attention mechanism and low-rank tensor fusion model includes an attention mechanism model network, a tactile feature fusion module and a classifier.

[0038] The grasping prediction module is used to acquire tactile images of the target object grasped by the robot, input the tactile images into a trained model based on attention mechanism and low-rank tensor fusion, and output a prediction result of whether the target object has been successfully grasped.

[0039] Furthermore, to achieve the above objectives, the present invention also provides a terminal, wherein the terminal includes: a memory, a processor, and a robot grasping prediction program based on attention mechanism and low-rank tensor fusion stored in the memory and executable on the processor, wherein when the robot grasping prediction program based on attention mechanism and low-rank tensor fusion is executed by the processor, it implements the steps of the robot grasping prediction method based on attention mechanism and low-rank tensor fusion as described above.

[0040] Furthermore, to achieve the above objectives, the present invention also provides a computer-readable storage medium, wherein the computer-readable storage medium stores a robot grasping prediction program based on attention mechanism and low-rank tensor fusion, which, when executed by a processor, implements the steps of the robot grasping prediction method based on attention mechanism and low-rank tensor fusion as described above.

[0041] In this invention, a dataset of images of a robot grasping multiple different objects is obtained. This dataset is preprocessed and divided into a training set and a test set. The training and test sets are used to train and test an attention-based and low-rank tensor fusion model, resulting in a trained model. This model includes an attention-based network, a tactile feature fusion module, and a classifier. Tactile images of the robot grasping the target object are acquired and input into the trained model. The model then outputs a prediction of whether the object was successfully grasped. This invention enables the model to better capture key features and important information. By fusing multimodal information, it achieves more comprehensive and efficient feature extraction, generating richer fused features and effectively predicting grasp success. Attached Figure Description

[0042] Figure 1 This is a flowchart of a preferred embodiment of the robot grasping prediction method based on attention mechanism and low-rank tensor fusion of the present invention;

[0043] Figure 2 This is a schematic diagram of the structure of the attention mechanism-based model network in a preferred embodiment of the robot grasping prediction method based on attention mechanism and low-rank tensor fusion of the present invention.

[0044] Figure 3 This is a schematic diagram of the tactile feature fusion module performing feature fusion in a preferred embodiment of the robot grasping prediction method based on attention mechanism and low-rank tensor fusion of the present invention;

[0045] Figure 4 This is a flowchart illustrating the training and testing process of the attention mechanism and low-rank tensor fusion model in a preferred embodiment of the robot grasping prediction method based on attention mechanism and low-rank tensor fusion of the present invention.

[0046] Figure 5 This is a schematic diagram of a preferred embodiment of the robot grasping prediction system based on attention mechanism and low-rank tensor fusion of the present invention.

[0047] Figure 6 This is a schematic diagram of the operating environment of a preferred embodiment of the terminal of the present invention. Detailed Implementation

[0048] This invention proposes a method for predicting the stability of robot grasping, based on the CBAM attention mechanism and low-rank multimodal fusion. By introducing a CBAM attention module into the convolutional neural network, the feature extraction capability of the neural network is enhanced from both channel and spatial perspectives. The low-rank multimodal fusion method effectively solves the problem of insufficient feature complementarity and information redundancy between two tactile modalities in traditional feature concatenation methods. The proposed method is a novel tactile-based robot grasping stability prediction method. This method can effectively predict whether grasping is successful, and its framework and core ideas are independent of tactile sensors and robot hand types, exhibiting good generalization ability. The model of this invention extracts key features from two tactile sensors and more effectively fuses information from both sensors. By fully utilizing these technologies, the robot can achieve stronger perception capabilities, thereby improving the accuracy of grasping.

[0049] To make the objectives, technical solutions, and advantages of this invention clearer and more explicit, the invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

[0050] The preferred embodiment of the robot grasping prediction method based on attention mechanism and low-rank tensor fusion described in this invention, such as... Figure 1 As shown, the robot grasping prediction method based on attention mechanism and low-rank tensor fusion includes the following steps:

[0051] Step S10: Obtain a dataset of images of the robot grasping multiple different objects, and divide the dataset into a training set and a test set after preprocessing.

[0052] Specifically, a dataset of grasping images of multiple objects collected by the robot through its robotic arm, parallel gripper, two photoelectric tactile sensors, and two depth camera sensors is obtained (tactile information covering the robot's grasping dataset is obtained by downloading a publicly available dataset online). The grasping image dataset is preprocessed, including random horizontal flipping, random cropping, and normalization. The preprocessed grasping image dataset is then divided proportionally to obtain a training set and a test set.

[0053] The dataset used in this invention was acquired using a 7-DOF robotic arm (e.g., the Sawyer robotic arm, which is lightweight and suitable for fine operations in confined workspaces), a parallel gripper (e.g., the Weiss WSG-50 parallel gripper, which uses the parallel gripping principle and can grip objects of various shapes and sizes), two GelSight photoelectric tactile sensors, and a Microsoft Kinect 2 depth camera sensor. In each experiment, images were acquired in three steps: (1) at time Ta, the initial state of the system before grasping the object (i.e., at time Ta, the initial state image of the object before grasping was obtained); (2) at time Tb, the contact state of the grasped object, measuring the state where the gripper completes the closure of its fingers, but the object is still on the ground (i.e., at time Tb, the contact state image of the grasped object, measuring the contact state image where the gripper completes the closure of its fingers and the object is still on the ground); (3) at time Tc, the state where the grasped object successfully stays in the air for 2 seconds (i.e., at time Tc, the grasping state image where the grasped object successfully stays in the air for a preset time). Whether the object can be successfully captured at time Tc is used as the label information for this experiment. Thus, each routine includes six images and their corresponding capture result labels. To enhance the generalization performance of the model, these images are resized to 256×256, and then randomly sampled from them at 224×224, cropped, and randomly rotated. This operation can increase the diversity of data and improve the robustness and generalization ability of the model.

[0054] After obtaining the image dataset, it was divided into a training set and a test set. In this process, sample data from 70 items were randomly selected as the training set, while sample data from the remaining 36 items were used as the test set.

[0055] Step S20: Train and test the attention mechanism-based and low-rank tensor fusion model using the training set and the test set to obtain the trained attention mechanism-based and low-rank tensor fusion model. The attention mechanism-based and low-rank tensor fusion model includes an attention mechanism-based model network, a tactile feature fusion module, and a classifier.

[0056] Specifically, the attention-based model network extracts features from the tactile images in the test set and obtains a one-dimensional feature vector through global average pooling. This one-dimensional feature vector is then input into the tactile feature fusion module. The tactile feature fusion module performs low-rank tensor multimodal fusion on the features from the left and right tactile sensors in the one-dimensional feature vector to obtain a fused feature, which is then input into the classifier. The classifier performs classification processing on the fused feature and outputs a prediction result indicating whether the object was successfully grasped. After all tactile images in the test set have been predicted, the test set is input into the attention-based and low-rank tensor fusion model for testing. If the test results meet preset requirements, the trained attention-based and low-rank tensor fusion model is obtained.

[0057] like Figure 2 As shown, an attention module based on CBAM (Convolutional Block Attention Module) is added to the residual structure of the ResNet50 network. This attention mechanism-based model network includes four attention residual blocks, each containing a convolutional layer and an attention module. Within each attention residual block, after three convolutional operations, the CBAM attention module is added, resulting in a weighted convolutional result that is residually connected to the bottleneck input. The same bottleneck structure is connected multiple times to obtain a deeper network structure. The haptic images from the test set are input to this attention mechanism-based model (CBAM-based ResNet50 network), where features are extracted through the four attention residual blocks and then subjected to global average pooling to obtain a one-dimensional feature vector for subsequent feature fusion.

[0058] like Figure 3 As shown, the one-dimensional feature vector includes the feature Z of the left tactile sensor. L The Z-shaped feature of the right-side tactile sensor R The tactile feature fusion module integrates the Z-features from the left tactile sensor. L The Z-shaped feature of the right-side tactile sensor R Low-rank tensor multimodal fusion is performed to obtain the fused feature F:

[0059]

[0060] in, This represents the low-rank factor corresponding to the left tactile sensor mode. is the low-rank factor corresponding to the right tactile sensor mode, o is the element-wise multiplication of the vector, and r is the rank of the vector.

[0061] The classifier includes a first fully connected layer and a second fully connected layer. A ResNet50 based on CBAM is used to extract features from the two tactile sensors. Then, through low-rank multimodal fusion, these features are integrated into a 4096-dimensional vector. The fused feature F is input into the classifier, which performs classification processing on the fused feature F through the first fully connected layer (FC) and the second fully connected layer (FC), and outputs a prediction result of whether the object was successfully grasped. The output dimension of the first FC layer is 1024, followed by a ReLU function, while the output of the second FC layer is 2-dimensional and used for the final prediction of successful grasping.

[0062] Specifically, such as Figure 4 As shown, after data collection, data processing and dataset partitioning are performed to obtain training and test sets. The training set is input into an attention-based model network for processing to obtain two tactile feature vectors. Then, the two tactile feature vectors are fused using a low-rank tensor multimodal fusion module to obtain fused features. The fused features are then input into the classifier for classification to obtain the overall model. The test set is then input into the overall model for testing (e.g., through accuracy, recall, F1 score, etc.). If the test results meet the preset requirements, the trained attention-based and low-rank tensor fusion model is obtained.

[0063] Step S30: Obtain tactile images of the target object grasped by the robot, input the tactile images into a trained model based on attention mechanism and low-rank tensor fusion, and output a prediction result of whether the target object has been successfully grasped.

[0064] Specifically, after obtaining a trained attention-based and low-rank tensor fusion model through the training and testing sets, the trained attention-based and low-rank tensor fusion model can be directly used to predict the tactile images of new target objects. The tactile images of the target objects grasped by the robot can be directly input into the trained attention-based and low-rank tensor fusion model to directly output the prediction result of whether the target object has been successfully grasped.

[0065] Beneficial effects:

[0066] (1) This invention proposes a method based on CBAM (Convolutional Block Attention Module) attention mechanism and low-rank multimodal fusion to quickly and accurately predict the results of grasping objects.

[0067] (2) By applying a spatial and channel-based attention module in the neural network, the present invention can further enhance the feature extraction capability of tactile modalities, thereby enabling the model to better capture key features and important information.

[0068] (3) This invention applies low-rank multimodal to two tactile sensors and achieves more comprehensive and efficient feature extraction by fusing multimodal information, which can generate more information-rich fused features.

[0069] Compared with existing grasping and prediction methods, this invention uses the CBAM attention mechanism to consider both channel and spatial attention. This comprehensive attention mechanism enables CBAM to efficiently capture complex patterns and features in tactile data. By accurately capturing these features, CBAM effectively improves the overall performance of the model. Introducing a low-rank multimodal fusion method for feature-level fusion of two tactile sensations decomposes the weight matrix in the fusion process into low-rank weight factors, thereby avoiding the computation of high-dimensional tensors, reducing computational complexity, and extracting superior fusion features, thus reducing the risk of model overfitting.

[0070] The network structure proposed in this invention has been well validated on the Calandra public dataset, and its prediction performance is improved compared with the original model, proving that the method is feasible.

[0071] Furthermore, the backbone network used in this invention can be replaced by networks other than ResNet-50, such as VGG, AlexNet, etc. The visual-based tactile modal information input in this invention can theoretically be replaced by other modal data, such as visual image information.

[0072] Furthermore, such as Figure 5 As shown, based on the above-mentioned robot grasping prediction method based on attention mechanism and low-rank tensor fusion, the present invention also provides a robot grasping prediction system based on attention mechanism and low-rank tensor fusion, wherein the robot grasping prediction system based on attention mechanism and low-rank tensor fusion includes:

[0073] The data acquisition module 51 is used to acquire a dataset of images of the robot grasping multiple different objects, and to divide the grasping image dataset into a training set and a test set after preprocessing.

[0074] The model building module 52 is used to train and test the attention mechanism and low-rank tensor fusion model using the training set and the test set to obtain the trained attention mechanism and low-rank tensor fusion model. The attention mechanism and low-rank tensor fusion model includes an attention mechanism model network, a tactile feature fusion module and a classifier.

[0075] The grasping prediction module 53 is used to acquire tactile images of the target object grasped by the robot, input the tactile images into a trained model based on attention mechanism and low-rank tensor fusion, and output a prediction result of whether the target object has been successfully grasped.

[0076] Furthermore, such as Figure 6 As shown, based on the above-mentioned robot grasping prediction method and system based on attention mechanism and low-rank tensor fusion, the present invention also provides a terminal, which includes a processor 10, a memory 20 and a display 30. Figure 6 Only some of the terminal components are shown; however, it should be understood that it is not required to implement all of the components shown, and more or fewer components may be implemented instead.

[0077] In some embodiments, the memory 20 may be an internal storage unit of the terminal, such as a hard disk or memory. In other embodiments, the memory 20 may be an external storage device of the terminal, such as a plug-in hard disk, smart media card (SMC), secure digital card (SD), flash card, etc. Further, the memory 20 may include both internal and external storage devices. The memory 20 is used to store application software and various types of data installed on the terminal, such as the program code installed on the terminal. The memory 20 can also be used to temporarily store data that has been output or will be output. In one embodiment, the memory 20 stores a robot grasping prediction program 40 based on attention mechanism and low-rank tensor fusion, which can be executed by the processor 10 to implement the robot grasping prediction method based on attention mechanism and low-rank tensor fusion in this application.

[0078] In some embodiments, the processor 10 may be a central processing unit (CPU), a microprocessor, or other data processing chip, used to run program code stored in the memory 20 or process data, such as executing the robot grasping prediction method based on attention mechanism and low-rank tensor fusion.

[0079] In some embodiments, the display 30 may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, or an OLED (Organic Light-Emitting Diode) touchscreen. The display 30 is used to display information on the terminal and to display a visual user interface. The components 10-30 of the terminal communicate with each other via a system bus.

[0080] In one embodiment, when the processor 10 executes the robot grasping prediction program 40 based on attention mechanism and low-rank tensor fusion in the memory 20, it implements the steps of the robot grasping prediction method based on attention mechanism and low-rank tensor fusion as described above.

[0081] The present invention also provides a computer-readable storage medium, wherein the computer-readable storage medium stores a robot grasping prediction program based on attention mechanism and low-rank tensor fusion, wherein when the robot grasping prediction program based on attention mechanism and low-rank tensor fusion is executed by a processor, it implements the steps of the robot grasping prediction method based on attention mechanism and low-rank tensor fusion as described above.

[0082] In summary, this invention provides a robot grasping prediction method and related equipment based on attention mechanism and low-rank tensor fusion. The method includes: acquiring a dataset of grasping images of a robot grasping multiple different objects; preprocessing the grasping image dataset and dividing it into a training set and a test set; training and testing an attention mechanism and low-rank tensor fusion model using the training set and the test set to obtain a trained attention mechanism and low-rank tensor fusion model, wherein the attention mechanism and low-rank tensor fusion model includes an attention mechanism model network, a tactile feature fusion module, and a classifier; acquiring tactile images of the robot grasping the target object; inputting the tactile images into the trained attention mechanism and low-rank tensor fusion model; and outputting a prediction result of whether the target object was successfully grasped. This invention enables the model to better capture key features and important information, achieves more comprehensive and efficient feature extraction by fusing multimodal information, generates richer fused features, and effectively predicts whether grasping is successful.

[0083] It should be noted that, in this document, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, object, or terminal that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, object, or terminal. Unless otherwise specified, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, object, or terminal that includes that element.

[0084] Of course, those skilled in the art will understand that all or part of the processes in the above embodiments can be implemented by a computer program instructing related hardware (such as a processor, controller, etc.). The program can be stored in a computer-readable storage medium, and when executed, it can include the processes described in the above method embodiments. The computer-readable storage medium can be a memory, magnetic disk, optical disk, etc.

[0085] It should be understood that the application of the present invention is not limited to the examples above. Those skilled in the art can make improvements or modifications based on the above description, and all such improvements and modifications should fall within the protection scope of the appended claims.

Claims

1. A robot grasping prediction method based on attention mechanism and low-rank tensor fusion, characterized in that, The robot grasping prediction method based on attention mechanism and low-rank tensor fusion includes: A dataset of images of a robot grasping multiple different objects is obtained, and the grasping image dataset is preprocessed and divided into a training set and a test set. The attention mechanism and low-rank tensor fusion model is trained and tested using the training set and the test set to obtain a trained attention mechanism and low-rank tensor fusion model. The attention mechanism and low-rank tensor fusion model includes an attention mechanism model network, a tactile feature fusion module, and a classifier. The robot acquires tactile images of the target object it grasps, inputs these images into a trained model based on attention mechanism and low-rank tensor fusion, and outputs a prediction of whether the target object has been successfully grasped. The process of acquiring a dataset of images of the robot grasping multiple different objects involves preprocessing the dataset and dividing it into a training set and a test set. Specifically, this includes: The robot acquires a dataset of images of multiple objects grasped by its robotic arm, parallel gripper, two photoelectric tactile sensors, and two depth camera sensors. The captured image dataset is preprocessed, including random horizontal flipping, random cropping, and normalization. The preprocessed captured image dataset is then divided proportionally to obtain a training set and a test set. The process of training and testing the attention-based and low-rank tensor fusion model using the training set and the test set to obtain a trained attention-based and low-rank tensor fusion model specifically includes: The attention-based model network extracts features from the tactile images in the training set and obtains a one-dimensional feature vector through global average pooling. The one-dimensional feature vector is then input into the tactile feature fusion module. The tactile feature fusion module performs low-rank tensor multimodal fusion of the features of the left tactile sensor and the features of the right tactile sensor in the one-dimensional feature vector to obtain fused features, and inputs the fused features into the classifier; The classifier classifies the fused features and outputs a prediction result of whether the object was successfully captured. Once all tactile images in the training set have been predicted, the test set is input into the attention-based and low-rank tensor fusion model for testing. If the test results meet the preset requirements, the trained attention-based and low-rank tensor fusion model is obtained.

2. The robot grasping prediction method based on attention mechanism and low-rank tensor fusion according to claim 1, characterized in that, The process of acquiring the captured image dataset includes: At time Ta, obtain the initial state diagram of the object before it is grasped; At time Tb, acquire the contact state of the grasped object and measure the contact state diagram when the gripper completes finger closure and the object is still on the ground; At time Tc, acquire the grabbing state diagram of the grabbed object successfully staying in the air for a preset time; Whether the object can be successfully captured at time Tc is used as the label information of the experiment.

3. The robot grasping prediction method based on attention mechanism and low-rank tensor fusion according to claim 1, characterized in that, The attention-based model network includes four attention residual blocks, each of which includes a convolutional layer and an attention module; The tactile images in the training set are input into the attention-based model network, where features are extracted through four attention residual blocks and then obtained as a one-dimensional feature vector through global average pooling.

4. The robot grasping prediction method based on attention mechanism and low-rank tensor fusion according to claim 3, characterized in that, The one-dimensional feature vector includes features of the left tactile sensor. Features of the right-side tactile sensor ; The tactile feature fusion module integrates the features of the left tactile sensor. Features of the right-side tactile sensor Low-rank tensor multimodal fusion is performed to obtain the fused feature F: ; in, This represents the low-rank factor corresponding to the left tactile sensor mode. This represents the low-rank factor corresponding to the right tactile sensor mode. Multiply vectors element by element. r Let be the rank of the vector.

5. The robot grasping prediction method based on attention mechanism and low-rank tensor fusion according to claim 4, characterized in that, The classifier includes a first fully connected layer and a second fully connected layer; The fused feature F is input into a classifier, which performs classification processing on the fused feature F through the first fully connected layer and the second fully connected layer, and outputs a prediction result of whether the object is successfully grasped.

6. A robot grasping prediction system based on attention mechanism and low-rank tensor fusion, characterized in that, The robot grasping prediction system based on attention mechanism and low-rank tensor fusion is used to implement the robot grasping prediction method based on attention mechanism and low-rank tensor fusion as described in any one of claims 1-5, wherein the robot grasping prediction system based on attention mechanism and low-rank tensor fusion comprises: The data acquisition module is used to acquire a dataset of images of the robot grasping multiple different objects, and to preprocess the grasping image dataset into a training set and a test set. The model building module is used to train and test the attention mechanism and low-rank tensor fusion model using the training set and the test set to obtain the trained attention mechanism and low-rank tensor fusion model. The attention mechanism and low-rank tensor fusion model includes an attention mechanism model network, a tactile feature fusion module and a classifier. The grasping prediction module is used to acquire tactile images of the target object grasped by the robot, input the tactile images into a trained model based on attention mechanism and low-rank tensor fusion, and output a prediction result of whether the target object has been successfully grasped.

7. A terminal, characterized in that, The terminal includes: a memory, a processor, and a robot grasping prediction program based on attention mechanism and low-rank tensor fusion stored in the memory and executable on the processor. When the robot grasping prediction program based on attention mechanism and low-rank tensor fusion is executed by the processor, it implements the steps of the robot grasping prediction method based on attention mechanism and low-rank tensor fusion as described in any one of claims 1-5.

8. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a robot grasping prediction program based on attention mechanism and low-rank tensor fusion, which, when executed by a processor, implements the steps of the robot grasping prediction method based on attention mechanism and low-rank tensor fusion as described in any one of claims 1-5.