A method for acquiring a robot autonomous picking and placing pose in a cluttered scene

By using reinforcement learning and deep feature template matching, the robot can autonomously pick up and place target objects in cluttered scenes, solving the problem of low picking and placing efficiency in traditional methods and achieving autonomous picking and placing with a high success rate.

CN118081758BActive Publication Date: 2026-06-26HUNAN UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
HUNAN UNIV
Filing Date
2024-03-25
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

In cluttered scenes, traditional robots struggle to autonomously explore and pick up occluded target objects and place them in specific poses. Existing methods are inefficient when dealing with unknown quantities, shapes, and occluded objects.

Method used

We employ a reinforcement learning-based approach, utilizing an actor-critic deep reinforcement learning model, combined with semantic segmentation and deep feature template matching. The autonomous pose acquisition model removes obstacles through unpredictable action sequences, while the target pose acquisition model achieves high-precision placement through deep feature template matching.

Benefits of technology

In cluttered scenarios, the robot achieved a target pickup success rate of over 80% and a placement success rate of over 90%, improving its flexibility and adaptability in unstructured environments and enhancing the efficiency and accuracy of pickup pose acquisition tasks.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN118081758B_ABST
    Figure CN118081758B_ABST
Patent Text Reader

Abstract

The application discloses a kind of under messy scene robot autonomous pickup and pose acquisition method of placement, the color image of work scene obtained by camera is as input, target information is obtained using semantic segmentation model and input information representation ability is strengthened, a kind of actor-critic form deep reinforcement learning method is used to remove obstacle autonomously and explore target object pose, then target placement pose is obtained using the method based on depth feature template matching, finally, pick and place object to specific pose, the scheme can explore the occluded target object in complex scene and place to specific pose.The scheme realizes high-precision positioning of target placement pose according to pickup pose, and the target pickup success rate in messy scene can reach more than 80%, which improves the success rate and stability of target placement pose acquisition task as a whole.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of robotic arms grasping and placing objects in cluttered scenes, specifically a method for obtaining the pose of a robot autonomously picking up and placing objects in cluttered scenes. Background Technology

[0002] Traditional robotic grasping and placement tasks have been extensively studied and have achieved great success in structured scenarios. Traditional systems utilize prior knowledge of known objects, such as the manipulator's 3D model and the object's physical properties, to find stable force closures for grasping known objects, followed by script planning and motion control. While these systems are robust in structured environments (such as manufacturing), in unstructured environments (such as logistics, agriculture, and homes), target objects are often obscured by obstacles, making traditional systems difficult to deploy. This new method, however, adopts an approach of first exploring and picking up the target before matching and placing it in cluttered scenarios, enabling effective deployment in unstructured environments.

[0003] In the target-picking task, recent data-driven approaches leverage learning algorithms and data (collected from human or physical experiments) to directly map visual observations to action representations. This approach is data-driven and model-agnostic, with the learning model trained through self-supervision. To mitigate uncertainty and collisions in cluttered scenes, both model-driven and data-driven approaches have investigated using imperceptible operations, such as pushing, to avoid collisions. Push-grabbing systems have improved with increasing push frequency. Similar to these approaches, this method learns a push-grab cooperative strategy to rearrange objects in a cluttered scene for collision-free target picking, but further considers scene complexity. Instead of assuming initial target visibility, this method leverages instance push to explore target instances in challenging cluttered scenes.

[0004] In matching and placement tasks, object-centric representations are typically used. In visual servoing control, extensive research has been conducted on object detection and pose estimation. These methods usually require object-specific training data. Alternative representations, including keypoints and dense descriptors, have proven capable of class-level generalization and representing deformable objects, but still struggle to represent scenes with an unknown number of objects (e.g., piles of small objects) or occluded objects. This method achieves sample-efficient end-to-end learning without object-centric representations and demonstrates that this object placement pose acquisition model better handles tasks requiring precise placement, multi-step sequencing, and closed-loop visual feedback, and can generalize to tasks with unseen objects, variable numbers of objects, and objects of different shapes. Summary of the Invention

[0005] This invention provides a method for obtaining the pose of autonomous picking and placing of robots in cluttered scenes based on reinforcement learning, enabling the robot to explore occluded target objects in complex scenes and place them in specific poses.

[0006] The technical solution proposed in this invention is as follows:

[0007] On the one hand, a method for obtaining the pose of a robot autonomously picking up objects in cluttered scenes based on reinforcement learning includes the following steps:

[0008] S1. Acquire an image of a scene in the workspace where targets are randomly and haphazardly piled up and obscured by obstacles;

[0009] S2. Using the known camera extrinsic parameters, perform orthogonal projection and semantic segmentation on the image obtained in S1 to obtain color, depth height map and target mask. Then, perform isoangular rotation on the color, depth height map and target mask to obtain an image for visual observation.

[0010] Using known camera extrinsic parameters, the image is orthogonally projected to obtain color and depth height maps. A robust semantic segmentation module is used to annotate objects of interest and detect the presence of targets to obtain target prediction masks. The color height map, depth height map, and target prediction mask are rotated at equal angles to obtain height images at different angles for visual observation.

[0011] S3. Construct and train a target picking pose acquisition model based on actor-critic deep reinforcement learning;

[0012] The target picking pose acquisition model based on actor-critic deep reinforcement learning consists of a critic network and two actor executors, namely a Bayesian-based actor executor and a classifier-based actor executor.

[0013] During the training of the target pickup pose acquisition model, visual observation is used as the state representation in reinforcement learning. The critic network evaluates all potential actions based on the state information. The actor actuator executes the best action in that state based on the scores of all actions and experiential knowledge, changes the current state to obtain the next state, and repeats this process to obtain the sequence of actions executed by the actor actuator. After removing the obstacle, the pose corresponding to the next action executed by the actor actuator is used as the final optimal pickup pose. In each iteration of training, visual observation is the input information, and the output information is the non-grasping action and / or grasping action made by the robot based on the current state.

[0014] Potential actions refer to the pre-defined standard actions of a robot corresponding to a pre-defined pixel.

[0015] A state is a visual observation at a given moment. Non-grasping actions refer to pushing, while grasping actions refer to grabbing. Because the scene changes after each action is performed, the robot performs actions based on the changed scene. The unpredictability of scene changes makes the actions unpredictable, so the robot is autonomous.

[0016] S4. The visual observation obtained by processing the real-time acquired image according to S2 is input into the trained target picking pose acquisition model, and a series of unpredictable action sequences are output. The cluttered objects around the target are rearranged through the unpredictable action sequences to remove obstacles and explore the target until the space around the target satisfies the collision-free grasping condition. The optimal target picking pose corresponding to the action executed in this state is then obtained.

[0017] Furthermore, the critic network maps visual observations to the expected rewards of robot actions to measure the Q-value of all executable actions. A larger Q-value indicates a greater reward for the robot after performing an action in that pose.

[0018] The actor's actuator selects the best action to execute based on the Q-values ​​of all executable actions derived from the critic network and pre-defined empirical knowledge.

[0019] If the predicted mask image output by the semantic segmentation module does not contain the target, the target is determined to be invisible. The Bayesian-based actor executor will then predict the best exploration push based on the prior probability of obstacles and the Q value of the general push action to explore the target.

[0020] Conversely, if the predicted mask image output by the semantic segmentation module contains the target, the target is determined to be visible. The actor executor based on the classifier will predict and coordinate the push and grab actions towards the target based on the Q-value of the grab action and the Q-value of the push action.

[0021] The depth height map is rotated counterclockwise 16 times at equal angles to obtain 16 height maps at different angles, which are used to represent different directions of the action.

[0022] Furthermore, the critic network employs a deep Q-function reinforcement learning network, including convolutional layers, feature extraction layers, and push / grab network layers (fully convolutional network). The critic network uses color height images, depth height images, and target prediction masks from different angles as reinforcement learning state representations and as inputs, and outputs a pixel-wise mapping from the visual state space to the action space, i.e., the Q-value of each executable action.

[0023] Q-value can effectively judge the quality of an actor's actuator movements;

[0024] Furthermore, the actor actuator combines the Q-values ​​of the executable actions output by the critic network with empirical knowledge to obtain the pushing or grabbing actions that the two actor actuators need to perform in different scenarios.

[0025] The Bayesian-based actor actuator uses the product of the general push action Q-value distribution and the obstacle prior probability distribution as the prior probability of the exploration action. It constructs a multimodal Gaussian kernel with a low peak value based on the poses of the three most recent failed target exploration actions. The kernel function represents the probability of the previous failed exploration action. Each execution uses the probability of the previous failed exploration action as a condition to obtain the posterior probability of the exploration action. The robot executes the exploration action according to the posterior probability.

[0026] Among them, the general push action Q-value distribution is obtained by feeding a constant all-one mask into the critic network. The probability constant all-one mask represents all objects in the workspace as potential targets. The obstacle prior is obtained by encoding obstacles in the form of a probability map. The obstacle prior probability distribution encodes the prior about the edge of the obstacle in the expected push direction.

[0027] The actor executor based on the binary classifier takes the maximum push action Q value, the maximum grab action Q value, the target boundary occupancy rate, the target boundary occupancy threshold, and the number of consecutive grab failures as input. If the target is visible, the actor executor based on the binary classifier will select and execute the best push action or the best grab action.

[0028] Two actuators: one is an exploration and pushing action sequence, and the other is a push-grab combined action sequence;

[0029] Secondly, a method for obtaining the autonomous placement pose of a robot in a cluttered scene based on reinforcement learning is characterized by first using the aforementioned method for obtaining the autonomous picking pose of a robot in a cluttered scene based on reinforcement learning to obtain the optimal picking pose of the target.

[0030] Next, the acquired placement area image is cropped with the target's optimal pickup pose as the center to obtain the local region of the target object. After feature extraction of the local region of the target object and the placement area image, the spatial displacement of the target is predicted by depth feature template matching to obtain the target's optimal placement pose.

[0031] Furthermore, preprocessing is performed on the placement area image before feature extraction. Specifically, the depth image of the placement area is processed and mapped to a three-dimensional point cloud using the depth information. For each pixel, the depth value is used to map it to the corresponding three-dimensional spatial coordinates to form point cloud data. Then, orthogonal projection is used to map the generated three-dimensional point cloud data onto a two-dimensional plane, where each pixel represents a fixed window in three-dimensional space, used to correspond to the pre-set standard actions of the robot.

[0032] Furthermore, predicting target spatial displacement by depth feature template matching means that after extracting the local depth features of the target object's local region, the target's local depth features are rotated into multiple directions to serve as templates to match the depth features of the placement area. The target's local depth features are superimposed one by one onto the depth features of the placement area, and the region with the highest feature correlation is found through convolution operation to match the optimal placement pose of the target.

[0033] In a specific example, the local features to be cropped are treated as convolution kernels. The depth features of the placement region are multiplied element-wise with the convolution kernels, and the results are summed to generate the output local feature map of optimal placement.

[0034] The target local image is specifically a cropped region of size c centered on the optimal picking pose. Then, features are extracted to obtain the features of the target local region.

[0035] Furthermore, the feature extraction network is a dual-stream feedforward FCN network, the input data is the visual observation information of the picking and placement areas, the visual observation information of the picking and placement areas includes the depth image of the picking local area and the depth image of the placement area, and the output is the depth feature of the target local area and the depth feature of the placement area.

[0036] The dual-stream feedforward FCN network uses an hourglass encoder-decoder architecture: each stream is an 8-step, 43-layer ResNet containing 12 residual blocks of 8 steps each. The encoder has three 2-step convolutional layers, the decoder has three bilinear upsampling layers, followed by a softmax layer for the entire image. After the first convolutional layer, each convolutional layer is equipped with a dilation layer, and ReLU activation functions are used to cross-place the layers before the last layer.

[0037] The choice of an 8-step size was made to strike a balance between maximizing the predicted receptive field coverage for each pixel and minimizing the resolution loss of potential mid-level features in the network.

[0038] Furthermore, during the training process, the feature extraction network decomposes each action into two training labels, which are used to generate binary one-hot pixel maps respectively; the training loss is the cross-entropy between all one-hot pixel maps and the pick-and-place prediction success rate, and Huber loss is used for training on each regression channel.

[0039] Furthermore, the prediction of target spatial displacement through depth feature template matching refers to treating the correlation calculation of pixel-level values ​​related to successful placement as a convolution operation, treating the cropped local features as the convolution kernel, multiplying the depth features of the placement area and the convolution kernel element by element, and summing the results to generate the output optimal placement local feature map, with the center of the placement local area being the optimal placement pose.

[0040] Beneficial effects

[0041] This invention provides a method for autonomously acquiring and placing poses of a robot in cluttered scenes. The method uses visual information of the work scene acquired by a camera as input. Preprocessing is required before inputting the visual information. Specifically, using a semantic segmentation model to segment a target mask as part of the input to the target acquisition pose model effectively improves the agent's perception of the acquisition environment, enhancing the efficiency and accuracy of the acquisition pose task. The target acquisition pose model employs a self-designed actor-critic form deep reinforcement learning method, enabling the robot to autonomously remove obstacles and explore target object poses in high-dimensional state and action spaces without manual feature extraction or complex rule design in unstructured scenes. This enhances the system's flexibility, adaptability, and generalization ability. (See attached figure.) Figure 11 As shown, the scheme can achieve a target pickup success rate of over 80% in cluttered scenes. The target placement pose acquisition model is designed as a deep feature template matching scheme. This scheme uses features extracted by deep learning to match with predefined templates, which can achieve high-precision positioning of the target placement pose based on the pickup pose, thereby improving the success rate and stability of the target placement pose acquisition task.

[0042] The above technical solutions were tested in both simulation and real-world environments to verify the effectiveness of the proposed methods in challenging environments, as shown in the attached figures. Figure 14 As shown, in a real environment with the same number of training steps, the success rate of placement and picking using the method of the present invention can reach over 90%, while the success rate of other methods is below 50%. Attached Figure Description

[0043] Figure 1 This is a schematic diagram of the overall system provided in an embodiment of the present invention;

[0044] Figure 2 The diagram illustrates the exploration and picking solution provided in this embodiment of the invention, wherein (a) is an example of a grabbing action of a non-target object (obstacle) in a scene where the target is occluded, (b) is an example of a pushing action to explore the target and / or remove obstacles, and (c) is a picking action facing the target object;

[0045] Figure 3 A flowchart of the critic network provided for embodiments of the present invention;

[0046] Figure 4 The following is a visualization example of the semantic segmentation model prediction provided in the embodiments of the present invention, wherein (a) is an example of a randomly stacked color image of obstacles and target objects to be segmented, and (b) is an example of the segmented target objects;

[0047] Figure 5 The following is a visual illustration of reward setting provided in the embodiments of the present invention: (a) is an example of a predetermined push vector passing through a target mask; (b) is an example of releasing the grabbing space around the target after the push action is performed; (c) is an example of a predetermined grabbing vector passing through a target mask without obstruction; and (d) is an example of the target being successfully picked up.

[0048] Figure 6 A flowchart of the explorer actuator provided in an embodiment of the present invention;

[0049] Figure 7 The following is a visualization diagram of the experience knowledge setting provided in the embodiments of the present invention, wherein (a) is an example of obstacle prior setting, (b) is an example of pre-set general driving probability, (c) is an example of multimodal Gaussian kernel setting, and (d) is an example of exploration driving operation performed based on posterior probability;

[0050] Figure 8 A flowchart of a coordinating actuator provided for an embodiment of the present invention;

[0051] Figure 9 This is an example diagram of a training scenario for a target picking pose acquisition model provided in an embodiment of the present invention;

[0052] Figure 10 This is a schematic diagram of the training structure of the target picking pose acquisition model provided in an embodiment of the present invention;

[0053] Figure 11 This is a schematic diagram of the training results of the target picking pose acquisition model provided in an embodiment of the present invention.

[0054] Figure 12 This is a flowchart of the target placement pose acquisition model provided in an embodiment of the present invention;

[0055] Figure 13 This is a visual schematic diagram of the deep template matching process provided in an embodiment of the present invention;

[0056] Figure 14 This is a comparative diagram showing the performance of various methods provided in the embodiments of the present invention;

[0057] Figure 15 A schematic diagram of a real-world experimental scenario provided for an embodiment of the present invention. Detailed Implementation

[0058] The technical solution of the present invention will be further explained and described below with reference to the accompanying drawings and specific examples.

[0059] Example 1

[0060] This embodiment discloses a method for autonomously acquiring pickup and placement poses in cluttered scenes based on reinforcement learning. The specific steps include:

[0061] S1. Acquire an image of a scene in the workspace where targets are randomly and haphazardly piled up and obscured by obstacles;

[0062] S2. Use known camera extrinsic parameters to orthogonally project the image to obtain color and depth height maps, and use a robust semantic segmentation module to annotate the object of interest to obtain a prediction mask. Detect the existence of the target by using the prediction mask, and rotate the color height map, depth height map and target prediction mask at equal angles to obtain height images at different angles for visual observation.

[0063] S3. Construct and train a target picking pose acquisition model based on actor-critic deep reinforcement learning;

[0064] The target picking pose acquisition model based on actor-critic deep reinforcement learning consists of a critic network and two actor executors, namely a Bayesian-based actor executor and a classifier-based actor executor.

[0065] During the training of the target pickup pose acquisition model, visual observation is used as the state representation in reinforcement learning. The critic network evaluates all potential actions based on the state information. The actor actuator executes the best action in that state based on the scores of all actions and experiential knowledge, changes the current state to obtain the next state, and repeats this process to obtain the sequence of actions executed by the actor actuator. After removing the obstacle, the pose corresponding to the next action executed by the actor actuator is used as the final optimal pickup pose. In each iteration of training, visual observation is the input information, and the output information is the non-grasping action and / or grasping action made by the robot based on the current state.

[0066] Potential actions refer to the pre-defined standard actions of a robot corresponding to a pre-defined pixel.

[0067] A state is a visual observation at a given moment. Non-grasping actions refer to pushing, while grasping actions refer to grabbing. Because the scene changes after each action is performed, the robot performs actions based on the changed scene. The unpredictability of scene changes makes the actions unpredictable, so the robot is autonomous.

[0068] S4. The visual observation obtained by processing the real-time acquired image according to S2 is input into the trained target picking pose acquisition model, and a series of unpredictable action sequences are output. The cluttered objects around the target are rearranged through the unpredictable action sequences to remove obstacles and explore the target until the space around the target satisfies the collision-free grasping condition. The optimal target picking pose corresponding to the action (the action with the maximum Q value) in this state is obtained.

[0069] Because the strategies trained through reinforcement learning are constantly changing according to the environment, the next state is different after each action, so the next action is unpredictable.

[0070] As attached Figure 1 As shown, one embodiment of the solution specifically includes acquiring the cluttered items (as shown in the attached image) in the workspace. Figure 2 The image of the scene shown in (a) includes an RGB color image and an RGB-D depth image. The color image and depth image are input into the semantic segmentation module to determine the presence of the target and obtain the target mask. The color image, depth image, and target mask are orthogonally projected into the corresponding height map as the reinforcement learning state representation. This height map is continuously input into the target pickup pose acquisition model to obtain a series of unpredictable actions. Through this series of unpredictable actions, the model explores the target and rearranges the surrounding cluttered objects to obtain the target pickup pose, as shown in the appendix. Figure 2 As shown in (b, c), the target pickup pose and the placement region image are input into the target placement pose acquisition model. The model extracts features from the cropped target local region and the placement region, and performs deep feature template matching between the target local region features and the placement region features to output the optimal placement pose prediction. Finally, the target pickup pose prediction and the placement pose prediction are used to parameterize the robot operation, and the robot performs actions to complete the pickup and placement task.

[0071] This invention proposes an actor-critic deep reinforcement learning framework to solve the problem of exploring and predicting the optimal pickup pose when the target is occluded. The critic network is constructed using an action value function, and this network evaluates the score Q of the actuator action based on visual input and a predetermined reward mechanism. p and Q g The actor's actuator receives the expected return Q(Q) from the critic network. p Q g The agent and its experiential knowledge (D) perform actions to change the environment in different scenarios. If the target is not visible, the actor's actuator performs a pushing action to explore the target. If the target is visible, the actor's actuator, based on the classifier, coordinates the selection of pushing and grasping actions. The robotic arm selects either a pushing or grasping action based on the selected action and the score Q of the critic network for that action. That is, the agent's picking strategy π = f(Q,D). When the target can be grasped without collision, the optimal picking pose is output.

[0072] After determining the optimal target pickup pose based on the specified actions according to the pickup strategy, the local area centered on the target is cropped. The target local region and the placement area RGB-D image are projected onto a 3D point cloud, then rendered onto an orthographic projection to extract the data from the target local region. The pixel-level features of the centered cropped and placement regions are then overlaid using template matching on the candidate pose. Centered cropping area At the top, where o t This refers to the observation of the placement area before pickup. The target placement pose acquisition model is based on a set of template poses. Matching local cropping region To explore its optimal placement position That is, the one with the highest eigencorrelation. Finally, the robot according to and Execute the pick and place action A t ,and and This refers to the canonical pick-and-place action that can be performed on every pixel.

[0073]

[0074] The key aspects of this embodiment lie in the design of the critic network, the actor actuator, and the target placement pose acquisition process:

[0075] 1. Critics' Network Design

[0076] Input: RGB image and RGB-D image of the selected region

[0077] Output: Push map Q p and capture mapping graph Q g

[0078] The Critics Network employs a deep Q-function reinforcement learning network, comprising convolutional layers, feature extraction layers, and push / grab network layers (fully convolutional network). The Critics Network uses color height images, depth height images, and target prediction masks from different angles as reinforcement learning state representations and as inputs, and outputs a pixel-wise mapping from the visual state space to the action space, i.e., the Q-value of each executable action.

[0079] This embodiment models the critic network as a Markov decision process, in state s t Execute action a t Then transition to state s t+1 and receive the corresponding reward R(s) t ,a t ,s t+1 The goal of the critic network is to learn an action-value function. The mapping Q is used to predict the value (Q-value) of pushing or holding an action a in state s under policy π.

[0080] As attached Figure 3 As shown, firstly, a fixed RGB-D camera captures images (RGB and RGB-D) of a predefined 44.8cm × 44.8cm pickup area. The RGB and RGB-D images are then fed into a pre-trained semantic segmentation module to obtain a target mask. Next, the RGB, RGB-D, and target mask images are orthogonally projected along the gravity direction according to known camera extrinsic parameters to construct an RGB color height map c. t RGB-D depth-height map d t and target mask height map m t , for each state s t Represented as an RGB-D-mask heightmap at time t, i.e., s t =(c t ,d t ,m t The RGB height map, RGB-D depth height map, and mask height map are each rotated counterclockwise by 16 equal angles to obtain 16 RGB-D-mask height maps at different angles to represent different motion angles. These RGB-D-mask height maps are then fed into corresponding two-layer residual networks for feature extraction. The output features are input into a DenseNet-121 pre-trained network on ImageNet for pixel-level feature extraction, outputting pixel-level feature maps. (Pushing network) and web scraping Using pixel-level feature maps as input, the output is a driving map Q. p and capture mapping graph Q g Q p and Q g Each pixel in the graph is parameterized with respect to the original push and grab, thus there exists a sequence from Q. p and Q g A direct mapping to the original motion, where each 2D pixel is mapped to a 3D action execution pose via a depth-height map, thus pushing the mapping map Q. p and capture mapping graph Q g It can effectively evaluate the score of the performed actions.

[0081] In this embodiment, Light-Weight RefineNet is used as the semantic segmentation module, and it is pre-trained on the dataset of this embodiment. This includes synthesizing a training dataset containing all target candidate instances, object pose variations, and occlusion data with a small amount of labeled data. The semantic segmentation pre-trained model can robustly segment cluttered scenes, as shown in the attached figure. Figure 4 As shown. If the target is ultimately segmented into its whole and / or parts, it is considered visible; otherwise, it is considered invisible.

[0082] Promote the network and web scraping It has the same fully convolutional network structure, with a three-layer residual network, and then uses bilinear upsampling. The three-layer residual network learns image features, and then the feature map is enlarged by bilinear upsampling to achieve the mapping and reconstruction of the input image.

[0083] When training the critic network, the error δ is defined by minimizing the time difference. t :

[0084]

[0085] Training is performed using the Huber loss function:

[0086]

[0087] Where, θ t These are the parameters of the critic network at time t, and the parameters of the target network. Keeping the gradient constant between iterations, at time t, the gradient is propagated only by a single pixel of the motion primitive, while all other pixels are backpropagated with zero loss. Q(θ) t ;s t ,a t ) indicates that in state s t Take action a t Q-value, a is the set of all actions, γ is the discount factor. Next state s t+1 Take action a t Q value, Representing state s t Perform action a t to state s t+1 The reward obtained. The discount factor ranges from 0 to 1, and it weighs the current and future benefits of the action. In this example, it is set to 0.5.

[0088] The critic network's reward scheme is divided into a pre-action phase and a post-action phase, and the larger of the two phases is selected as the reward. For the pre-action phase reward, it is checked whether the action is goal-oriented; for the post-action phase reward, a reward is given if the action achieves the expected result. The pre-action phase reward is designed to help optimize the pixel-level learning process, while the post-action phase reward may be relatively sparse, only being given when specific conditions are met.

[0089] If the expected thrust vector passes through the target mask m t (as attached) Figure 5 (a) shows the reward R. P (s t ,s t+1=0.25; If more space appears around the target object after pushing (as shown in the attached image) Figure 5 (b) shows the setting of reward R. P (s t ,s t+1 ) = 0.5. Around m t Expansion to construct the target boundary m b The mask (displayed as a light red mask), and if the boundary occupies the value o b (defined as m) b (Number of pixels above ground level) reduced by a certain threshold n b This indicates that an increase in space has been detected, suggesting that the pushing motion has released space around the target (as shown in the attached image). Figure 5 (as shown in (c)), therefore a reward of 0.5 is given. Similarly, those in m t The grab has the expected grab pose, and R is specified. g (s t ,s t+1 =0.5, if the target is successfully captured (as shown in the attached figure). Figure 5 (d) shows that R is specified. g (s t ,s t+1 ) = 1.

[0090] To handle sparse rewards, such as grasping, the critic network is trained with post-hoc priority experience replay, specifically, if a non-target object is grasped at time t, the executed action a is saved. t State s t The mask m' of the captured object t and post-event marking rewards For further experience playback training.

[0091] 2. Actor actuator design

[0092] Input: Push map Q p and capture mapping graph Q g

[0093] Output: Robot's actions and optimal pickup pose

[0094] To address the issue of targets being obscured in complex scenarios, the actor's home is designed as an exploration and coordination actuator.

[0095] As attached Figure 6 As shown, when the target is not visible, the exploration executor will push the commentator network output map Q. p and obstacle prior P c As input, output the optimal exploration action. cThe setting method is obtained by encoding the prior about the edge of the obstacle in the expected pushing direction in the form of a probabilistic graph (as shown in the appendix). Figure 7 -(a)) shown). In order to effectively explore the target in the picking region, a constant all-1 mask is fed into the push network. The general probability map Q obtained p A constant all-1 mask represents all objects in the workspace as potential targets (as shown in the attached image). Figure 7 -(b)). Using a general push to create a probability graph Q p and the prior P of the obstacle c The product of these factors serves as the prior probability of the exploration action. To avoid the robot getting stuck in a local area and considering past failures, a multimodal Gaussian kernel K with a low peak value is constructed from the poses of the three most recent failed actions. G Kernel function K G This indicates the probability of the last failed exploration action (as shown in the appendix). Figure 7 -(c)). Exploring execution action strategies π e (as attached) Figure 7 -(d)) Perform based on posterior probability:

[0096]

[0097] Where * represents the Hadamard product, also known as the element-wise product, and a represents the set of all exploration actions. The final exploration executor executes the action strategy π. e Perform the action.

[0098] The obstacle prior encoding scheme involves shifting the heightmap 25 pixels along a fixed axis (approximately twice the gripper's closing width). Pixels with sufficient depth difference between the original and shifted heightmaps are recorded as 1, otherwise as 0. This binary image is then filtered using a 25x25 all-1 kernel to obtain a pixel-by-pixel probability map. (Compared to Q...) Figure 1 Similarly, the height map is rotated in 16 directions to construct 16 probability maps.

[0099] As attached Figure 8 As shown, when the target is visible, a classifier-based coordinator executes target-oriented actions (push and grab). Unlike greedy deterministic strategies, this method uses an action classifier to coordinate push and grab. The binary classifier will maximize the push value q. p =maxQ p Maximum crawl value q g =maxQ g Target boundary occupancy rate Target boundary occupancy threshold and the number of consecutive failed captures c g As input, qp and q g It can reflect the success rate of instance crawling to some extent, and incorporate experience and knowledge (r b n b and c g ) is used as input to the classifier because 1)r b and n b 1) It is an indicator of obstacles around the target, which is difficult for the network to learn directly; 2) If the robot continues to fail to grasp, it should be encouraged and promoted.

[0100] In each training iteration, if the target is successfully captured, the program automatically marks the success probability as 1; if the capture pose is within the mask m, the program automatically marks the success probability as 1. t If the capture fails, it indicates that there are dense obstacles around the target, and the program will automatically mark the success probability data as 0. The coordinating actuator selects the action type based on the predicted probability and executes the action with the highest Q-value. The classifier is represented as the action classifier f. a Coordinating Executor Strategy π c It is expressed as:

[0101]

[0102]

[0103] Among them, f a It is modeled as a function approximator consisting of three fully connected layers, using batch normalization (BN) layers and activation function layers (ReLU). The influence of input variables is represented by weights, and unimportant variables are discarded by the ReLU activation function.

[0104] The coordinated actuator is trained using binary cross-entropy loss:

[0105]

[0106] Where y is the classifier's predicted value, It's a real label.

[0107] The training process involves randomly selecting n target candidate instances (i.e., those that can be detected by the semantic segmentation module) and m ordinary objects (obstacles), and randomly generating them in the pick-up area workspace (as shown in the attached image). Figure 9 As shown, the robot needs to randomly select a target to grasp. Once it successfully acquires the target, it assigns a new target. If there are no target candidates in the workspace, the object is randomly placed again, and the training is repeated iteratively.

[0108] Multi-stage learning was used to train the target picking pose acquisition model, as shown in the attached figure. Figure 10As shown. In the first stage, only the critic network is trained to achieve a good initial state, and the robot executes a greedy policy π in a cluttered environment ∈. ∈ Subsequently, the number of ordinary objects, m, is gradually increased based on the difficulty of pushing or grasping. Initially, m = 3 is set to briefly teach pushing or grasping. Then m is increased to 8, and the coordinating actuator is switched to learn a coordinating strategy π in dense clutter. c Meanwhile, the network of critics is still being trained, namely from... Fine-tuning to In the first phase (the first 1000 iterations), only the commentator network trained the strategy π. ∈ Explore and achieve a high instance crawling success rate (defined as...) After 1000 iterations, the coordinating executor policy π is trained. c A total of 3000 training iterations were performed, and the training results are attached. Figure 11 As shown, the trained model explores the target based on a series of unpredictable pushing actions and uses a series of unpredictable pushing and grabbing actions to remove obstacles around the target and obtain the optimal pickup pose.

[0109] 3. Design of the target placement pose acquisition process

[0110] Input: Target optimal pickup pose and placement area image

[0111] Output: Optimal placement pose of the target

[0112] The target placement pose acquisition process uses a fully convolutional network (FCN) to process the action-value function Q, which is associated with successful pickup. p ((u,v)|o t Modeling:

[0113]

[0114] Visual observation t It is a projected image reconstructed from the scene's RGB-D image, defined on a regular grid of pixels {(u, v)} at time step t of the sequential rearrangement task. Through camera-to-robot parameter calibration, o... t Each pixel in the image corresponds to the pick-and-place action for that pose: The template is obtained by cross-correlation of two pixel-level feature embeddings, ψ(·) and φ(·).

[0115]

[0116]

[0117] in, It is a two-dimensional coordinate system. It is an action-value function related to successful placement, and it is also modeled by a fully convolutional network (FCN). It covers the space of all possible placement poses. FCN is essentially translationally equivariant; if the object to be picked in the scene is translated, the picking pose will also translate accordingly.

[0118] Specifically, as shown in the attached document Figure 12 As shown, for the pick-and-place task, visual observation o t This is a top-down orthographic view of the pickup and placement area, generated from a 480×640 RGB-D image captured by a camera calibrated using known camera intrinsic and extrinsic parameters. Top-down visual observation. t The pixel resolution is 160×320, and each pixel represents a 3.125×3.125mm vertical column in the three-dimensional space of the workspace.

[0119] In this embodiment, the fully convolutional network is a two-stream feedforward fully convolutional network (FCN), where each stream is an 8-step, 43-layer ResNet, but the last layer has no non-linear activation layer. It uses visual observation of the target pickup pose and pickup / placement region. Given the input, output the optimal placement pose. During training, the target placement pose acquisition process breaks down each action into two training labels: and These are used to generate binary one-hot pixel maps. and For a given picking label, sum over all possible pixel positions, and considering the probability of picking operations at each position, to obtain the expected value. For a given placement label, sum over all possible pixel positions, and considering the probability of placement operation at each position, to obtain the expected value. These two expected values ​​are used to calculate the training loss. It is the difference between the cross-entropy of the pick operation and the cross-entropy of the place operation:

[0120]

[0121] Training is performed on each regression channel using Huber loss, which is capable of learning the action distribution in a multimodal anisotropic space.

[0122] The target will be picked up by pose. Local features of the target extracted from a local region of size c centered at the target. Scene matching feature map φ(o) extracted from the placement area t Cross-correlation is used to output pixel-level values ​​related to successful placement: Thus, the optimal placement posture is determined.

[0123] Crop out the local area centered on the target. The target local region and the placement area RGB-D image are projected onto a 3D point cloud, then rendered onto an orthographic projection to extract the data from the target local region. The pixel-level features of the centered cropped and placement regions are then overlaid using template matching on the candidate pose. Centered cropping area At the top, where o t This refers to the observation of the placement area before pickup. The target placement pose acquisition model is based on a set of template poses. Matching local cropping region To explore its optimal placement position That is, the one with the highest eigencorrelation. As attached Figure 13 As shown.

[0124] In summary, the robot receives the optimal pick-up pose and the optimal placement pose, and uses robot kinematics to calculate and plan actions to complete the pick-up and placement task. To evaluate the effectiveness of this method, 10... -5 A fixed learning rate is used to train Form2Fit, ConvMLP, and Transporter (the method of this disclosure) in the same simulation environment. See attached... Figure 14 As shown, the method provided in this embodiment of the invention generally converges faster, showing good performance after 3000 training iterations, while the other two methods are far inferior to this method in terms of convergence speed and pick-and-place success rate. In the real world, the experiment used a general-purpose robot UR5, a Linux PC, an RG2 gripper, and a depth camera (statically mounted on top of the workstation). The Photoneo camera provided depth (0.1 mm rated depth accuracy) and grayscale infrared images, both at a resolution of 1032×772. The Kinect camera provided depth and color RGB images at a resolution of 1280×720. To calibrate the camera in the robot coordinate system, a two-step procedure was used. First, the camera's interior was calibrated by capturing multiple images of a large planar QR code panel in different orientations, and OpenCV was used to calculate the camera's internal parameters. Second, to calibrate the external images, QR code tags were attached to the UR5 wrist joints, and multiple images of the robot were captured in random end-effector poses. These images were then used to solve for the pose of the robot's base and the offset of the QR code tags to their respective joints. The Linux system embedded in the robot can collect data from the robot and its cameras, as shown in the attached image. Figure 15As shown. In a real-world environment, the method provided by this embodiment of the invention achieves an 85.5% pick-up and placement success rate using a gripper end effector in environments where the target is not visible.

[0125] Based on this understanding, the technical solution of the present invention, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned readable storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0126] It should be understood that, in the embodiments of the present invention, the processor may be a Central Processing Unit (CPU), or it may be other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor may be a microprocessor or any conventional processor. The memory may include read-only memory and random access memory, and provides instructions and data to the processor. A portion of the memory may also include non-volatile random access memory. For example, the memory may also store device type information.

[0127] It should be emphasized that the examples described in this invention are illustrative rather than limiting. Therefore, this invention is not limited to the examples described in the specific embodiments. Any other embodiments derived by those skilled in the art based on the technical solutions of this invention, without departing from the spirit and scope of this invention, whether modifications or substitutions, are also within the protection scope of this invention.

Claims

1. A method for autonomously acquiring the pose of a robot in a cluttered scene, characterized in that, Includes the following steps: S1. Acquire an image of a scene in the workspace where targets are randomly and haphazardly piled up and obscured by obstacles; S2. Using the known camera extrinsic parameters, perform orthogonal projection and semantic segmentation on the image obtained in S1 to obtain color, depth height map and target mask. Then, perform isoangular rotation on the color, depth height map and target mask to obtain an image for visual observation. S3. Construct and train a target picking pose acquisition model based on actor-critic deep reinforcement learning; The target picking pose acquisition model based on actor-critic deep reinforcement learning consists of a critic network and two actor executors, namely a Bayesian-based actor executor and a binary classifier-based actor executor. During the training of the target pickup pose acquisition model, visual observation is used as the state representation in reinforcement learning. The critic network evaluates all potential actions based on the state information. The actor actuator executes the best action in that state based on the scores of all actions and experiential knowledge, changes the current state to obtain the next state, and repeats this process to obtain the sequence of actions executed by the actor actuator. After removing the obstacle, the pose corresponding to the next action executed by the actor actuator is used as the final optimal pickup pose. In each iteration of training, visual observation is the input information, and the output information is the non-grasping action and / or grasping action made by the robot based on the current state. S4. The visual observation obtained by processing the real-time acquired image according to S2 is input into the trained target picking pose acquisition model, and a series of unpredictable action sequences are output. The cluttered objects around the target are rearranged through the unpredictable action sequences to remove obstacles and explore the target until the space around the target satisfies the collision-free grasping condition. The optimal target picking pose corresponding to the action executed in this state is then obtained. The actor actuator combines the Q-value of the executable actions output by the critic network with empirical knowledge to obtain the push or grab actions that the two actor actuators need to perform in different scenarios. The Bayesian-based actor actuator uses the product of the general push action Q-value distribution and the obstacle prior probability distribution as the prior probability of the exploration action. It constructs a multimodal Gaussian kernel with a low peak value based on the poses of the three most recent failed target exploration actions. The kernel function represents the probability of the previous failed exploration action. Each execution uses the probability of the previous failed exploration action as a condition to obtain the posterior probability of the exploration action. The robot executes the exploration action according to the posterior probability. Among them, the general push action Q-value distribution is obtained by feeding a constant all-one mask into the critic network. The probability constant all-one mask represents all objects in the workspace as potential targets. The obstacle prior is obtained by encoding obstacles in the form of a probability map. The obstacle prior probability distribution encodes the prior about the edge of the obstacle in the expected push direction. The actor executor based on the binary classifier takes the maximum push action Q value, the maximum grab action Q value, the target boundary occupancy rate, the target boundary occupancy threshold, and the number of consecutive grab failures as input. If the target is visible, the actor executor based on the binary classifier will select and execute the best push action or the best grab action.

2. The method according to claim 1, characterized in that, The critic network maps visual observations to the expected rewards of robot actions to measure the Q-value of all executable actions. The larger the Q-value, the greater the reward the robot receives after performing the action in that pose. The actor's actuator selects the best action to execute based on the Q-values ​​of all executable actions derived from the critic network and pre-defined empirical knowledge. If the predicted mask image output by the semantic segmentation module does not contain the target, the target is determined to be invisible. The Bayesian-based actor executor will then predict and execute the best exploration push action based on the prior probability of obstacles and the Q value of the general push action to explore the target. Conversely, if the predicted mask image output by the semantic segmentation module contains the target, the target is determined to be visible. The actor executor based on the binary classifier will predict and coordinate the push and grab actions towards the target based on the Q value of the grab action and the Q value of the push action.

3. The method according to claim 1, characterized in that, The critic network employs a deep Q-function reinforcement learning network, including convolutional layers, feature extraction layers, and push / grab network layers. The critic network uses color and depth-height images at different angles and target prediction masks as reinforcement learning state representations and inputs, and outputs a pixel-wise mapping from the visual state space to the action space, i.e., the Q-value of each executable action.

4. A method for obtaining the autonomous placement pose of a robot in a cluttered scene, characterized in that, First, the optimal pickup pose of the target is obtained by using the method described in any one of claims 1-3; Next, the acquired placement area image is cropped with the target's optimal pickup pose as the center to obtain the local region of the target object. After feature extraction of the local region of the target object and the placement area image, the spatial displacement of the target is predicted by depth feature template matching to obtain the target's optimal placement pose.

5. The method according to claim 4, characterized in that, Before feature extraction from the placement area image, preprocessing is performed. Specifically, the depth image of the placement area is processed and mapped to a 3D point cloud using the depth information. For each pixel, the depth value is used to map it to the corresponding 3D spatial coordinates to form point cloud data. Then, orthogonal projection is used to map the generated 3D point cloud data onto a 2D plane, where each pixel represents a fixed window in 3D space, used to correspond to the pre-set standard actions of the robot.

6. The method according to claim 4, characterized in that, Predicting target spatial displacement by depth feature template matching refers to extracting local depth features from a local area of ​​the target object, rotating these local depth features into multiple directions to serve as templates for matching the depth features of the placement area, and then superimposing the target local depth features onto the placement area depth features one by one. Convolution operations are then used to find the region with the highest feature correlation in order to match the optimal placement pose of the target.

7. The method according to claim 4, characterized in that, The feature extraction network is a dual-stream feedforward FCN network. The input data is the visual observation information of the picking and placement areas, which includes the depth image of the picking local area and the depth image of the placement area. The output is the depth features of the target local area and the depth features of the placement area. The dual-stream feedforward FCN network uses an hourglass encoder-decoder architecture: each stream is an 8-step, 43-layer ResNet containing 12 residual blocks of 8 steps each. The encoder has three 2-step convolutional layers, the decoder has three bilinear upsampling layers, followed by a softmax layer for the entire image. After the first convolutional layer, each convolutional layer is equipped with a dilation layer, and ReLU activation functions are used to cross-place the layers before the last layer.

8. The method according to claim 7, characterized in that, During training, the feature extraction network decomposes each action into two training labels, which are used to generate binary one-hot pixel maps. The training loss is the cross-entropy between all one-hot pixel maps and the pick-and-place prediction success rate. Huber loss is used for training on each regression channel.

9. The method according to claim 6, characterized in that, The method of predicting target spatial displacement by matching deep feature templates refers to treating the correlation calculation of pixel-level values ​​related to successful placement as a convolution operation, treating the cropped local features as the convolution kernel, multiplying the depth features of the placement area and the convolution kernel element by element, and summing the results to generate the output optimal placement local feature map, with the center of the placement local area being the optimal placement pose.