A flexible object manipulation method and device based on key point detection
By using a key point detection-based method and a dual-arm robot and Transformer network, efficient and precise manipulation of fabric is achieved, solving the problems of accuracy, adaptability and cost in fabric handling in existing technologies and expanding the scope of application.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- TONGJI UNIV
- Filing Date
- 2024-02-19
- Publication Date
- 2026-06-23
Smart Images

Figure CN117901114B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the fields of intelligent robot technology and computer vision, and in particular to a method and apparatus for manipulating flexible objects based on key point detection. Background Technology
[0002] With the rapid development of artificial intelligence (AI) and machine learning (ML) technologies, intelligent robots are being used more and more widely in home life. Especially in household chores, service robots are increasingly being used, with functions including, but not limited to, folding fabrics, making beds, and organizing clothes. These robots are designed and functioned to improve the convenience and efficiency of home life, and are particularly important for the elderly or those requiring special care.
[0003] Fabric manipulation has always been a common yet challenging task in fields such as textile manufacturing, logistics, home laundry, healthcare, and hospitality. The efficient completion of these tasks is limited by the complex dynamics and infinite degrees of freedom of fabrics, traditionally relying primarily on manual operation. Early robotic fabric handling was mainly based on scripted strategies; while these methods were effective for specific types of clothing, configurations, and tasks, they were generally slow and difficult to apply to diverse fabric configurations.
[0004] Single-arm robot manipulation and complex iterative algorithms have been introduced to attempt to automate these tasks. However, these methods still have limitations in terms of operational accuracy and adaptability, and are time-consuming. Due to the infinite degrees of freedom and complex dynamics of cloth, robot manipulation faces significant challenges; even slight errors can lead to irreversible wrinkles. In recent years, learning-based methods have begun to show potential in cloth handling, particularly the application of model-free reinforcement learning, which aims to improve the efficiency and versatility of robot manipulation. These methods typically combine images of the current cloth state with images of the target shape to guide robot operations. While these learning-based methods have made progress in specific tasks such as unfolding, smoothing, and folding, they often rely on expensive human demonstrations and annotations, increasing implementation costs. Overall, although cloth handling has broad application prospects in multiple fields, existing technologies still face challenges in terms of operational accuracy, adaptability, efficiency, and generalization to more cloth handling tasks. This invention aims to develop an improved cloth handling technology to address these problems and enhance robot performance in complex cloth manipulation tasks. Summary of the Invention
[0005] The purpose of this invention is to overcome the shortcomings of the prior art by providing a flexible object manipulation method and apparatus based on key point detection.
[0006] The objective of this invention can be achieved through the following technical solutions:
[0007] This invention provides a method for manipulating flexible fabric based on key point detection, comprising the following steps:
[0008] Step 1: Construct a workspace for the dual-arm robot. The workspace includes a top depth camera, which captures real-time visual data of the fabric state and the surrounding environment within the manipulation area.
[0009] Step 2: Define multiple action primitives;
[0010] Step 3: Use VIT-Transformer to decode various action primitives, generate operation strategies, and output the actions to be executed;
[0011] Step 4: Use Swin-Transformer to detect and identify key points on the fabric, select an operation strategy, and generate motion commands;
[0012] Step 5: Send the motion command to the dual-arm robot, and have the dual-arm robot execute the motion command.
[0013] Furthermore, the various action primitives include pick-up and place action primitives; throw action primitives; drag action primitives; and fold action primitives.
[0014] Furthermore, the picking and placing action primitives are executed by a single-arm robot, accompanied by the second arm pressing the fabric at a point on the placement posture extension line.
[0015] Furthermore, the throwing action primitive is executed by a dual-arm robot, which grabs the clothing according to a given picking posture, lifts and stretches the clothing above the workspace until a preset force threshold is reached, and then the two arms dynamically swing while gradually lowering the height to bring the clothing toward the workspace, thus completing the throwing action.
[0016] Furthermore, the dragging action primitive is executed by a dual-arm robot, which drags the fabric away from its center by a fixed distance based on two pick-up points, using the friction between the fabric and the workspace to smooth wrinkles or adjust the position of the fabric.
[0017] Furthermore, the folding action primitive is executed by a dual-arm robot, and the picking and placing postures are determined by key point detection to achieve the folding action.
[0018] Furthermore, during the folding action, if the smoothness is greater than a specific threshold, the coverage is higher than the standard, and the number of detected key points is consistent with the preset number, then the folding action primitive is used to perform the folding operation of the fabric according to the position of the key points.
[0019] Furthermore, a reward function is constructed to measure smoothness, coverage, and number of keypoints, expressed by the following formula:
[0020] R all,t+1 =λ1R smooth,t+1 +λ2R cov,t+1 +λ3R key,t+1
[0021] Among them, R all,t+1 For the reward function, R smooth,t+1 For smoothness, R cov,t+1 For coverage, R key,t+1 λ1, λ2, and λ3 are the number of keypoints and weight parameters.
[0022] Furthermore, the smoothness is expressed by the following formula:
[0023]
[0024] Where, d t,i d t+1,i D represents the depth value of the i-th pixel at times t and t+1. t D t+1 This represents the depth information of all pixels in the depth dataset D at time steps t and t+1, where μ is the average depth value.
[0025] The coverage rate is expressed by the following formula:
[0026] R cov,t+1 =cov t+1 -cov t
[0027] Among them, cov t+1 Cov represents the maximum coverage of the working plane occupied by the image acquired by the depth camera at time t+1. t This represents the maximum coverage of the working plane occupied by the image acquired by the depth camera at time t;
[0028] The number of key points is expressed by the following formula:
[0029]
[0030] Where k represents the number of keypoints detected at time t+1, and m is the preset number of keypoints.
[0031] Secondly, the present invention provides a flexible fabric manipulation device based on key point detection, comprising:
[0032] The visual data acquisition module is used to capture visual data of the fabric status and surrounding environment in the manipulation area in real time.
[0033] The action primitive generation module is used to define various action primitives;
[0034] The VIT-Transformer cloth feature processing module is used to decode action primitives, generate operation strategies, and output the actions to be executed.
[0035] The Swin-Transformer key point detection and recognition module is used to detect and recognize key points on the fabric, select operation strategies, and generate action instructions.
[0036] The operation strategy generation and execution module is used to execute action instructions.
[0037] Compared with the prior art, the present invention has the following beneficial effects:
[0038] (1) Improved operational efficiency: By adaptively selecting dynamic throwing and quasi-static actions, such as picking up and placing, the present invention can achieve higher fabric coverage with fewer interaction steps, significantly improving the overall efficiency of fabric handling.
[0039] (2) Enhanced adaptability: The key point-based clothing representation simplifies the infinite degree of freedom of fabrics. This invention can adapt to a variety of different types and sizes of fabrics, improving its application flexibility in complex scenarios.
[0040] (3) Improved accuracy: By using a top depth camera and a Swin-Transformer feature extraction module, the present invention can accurately detect the fabric state and key points, thereby ensuring the precise execution of actions, especially during the folding process.
[0041] (4) Reduced implementation costs: By using a self-supervised learning method, this invention reduces the reliance on expensive human demonstrations and annotations, while improving the efficiency and automation of the learning process.
[0042] (5) Wide range of applications: This invention is not only applicable to fabric processing in daily life, such as clothing folding, but can also be extended to textile manufacturing, logistics, healthcare and hospitality industries. Attached Figure Description
[0043] Figure 1 This is a schematic diagram of the steps of the present invention;
[0044] Figure 2 This is a schematic diagram of the fabric feature processing module based on the Transformer structure of the present invention;
[0045] Figure 3 This is a schematic diagram illustrating the unfolding and folding of the fabric according to the present invention. Detailed Implementation
[0046] The present invention will now be described in detail with reference to the accompanying drawings and specific embodiments. These embodiments are based on the technical solution of the present invention and provide detailed implementation methods and specific operating procedures. However, the scope of protection of the present invention is not limited to the following embodiments.
[0047] Example
[0048] This embodiment provides a flexible object manipulation device based on key point detection. The device includes: a visual data acquisition module for real-time capture of visual data of the fabric state and surrounding environment within the manipulation area; an action primitive generation module for defining various action primitives; a VIT-Transformer fabric feature processing module for decoding the action primitives, outputting the actions to be executed, and flattening the wrinkled fabric; a Swin-Transformer key point detection and recognition module for extracting features from the visual data, simplifying the infinitely free fabric state into a finite key point representation, selecting an operation strategy, and generating action instructions; and an operation strategy generation and execution module for executing the action instructions through a dual-arm robot.
[0049] This embodiment provides a flexible object manipulation method based on key point detection, such as... Figure 1 As shown, the specific method for unfolding and folding is implemented using the aforementioned device and includes the following steps:
[0050] Step 1: Construct a workspace for the dual-arm robot. This workspace is equipped with a top depth camera. The depth camera at the top of the workspace captures real-time visual data of the fabric state and the surrounding environment within the manipulation area.
[0051] In this invention, to achieve precise manipulation of deformable objects such as clothing, this embodiment provides a flexible object manipulation method based on key point detection. The learning objective of this method is to learn a deformable object manipulation strategy π parameterized by parameter θ. θ This strategy aims to be based on visual observations of the clothing configuration s at the current time t. t ∈R W*H*C Generate and execute a series of actions {a t} (where t = 0, 1, 2, ..., T). These actions are calculated and executed in a closed-loop manner, with the aim of transferring the clothing from an arbitrary initial configuration to a user-defined target configuration. The state transition formulas are as shown in formulas (1) and (2), where τ represents the state transition transformation, o t+1 This represents the visual observation at time t. t Perform action a t The state of the fabric obtained later.
[0052] a t ←π(ot ,s) (1)
[0053] o t+1 ←τ(o t ,a t (2)
[0054] The manipulation method in this embodiment relies on a dual-arm robot equipped with parallel grippers, which executes various actions m from a predefined set of action primitives M. Each action primitive is parameterized by two planar gripper poses, where x and y represent coordinates in pixel space. θ is the end effector rotation angle about the z-axis. As shown in Equation (3):
[0055] a t ={m,(x1,y1,θ1),(x2,y2,θ2)} (3)
[0056] Step 2: Define multiple action primitives.
[0057] This embodiment employs a multi-primitive strategy to efficiently handle fabric unfolding and folding tasks. This method integrates quasi-static and dynamic action primitives, such as pick-up and place, and high-speed throwing, aiming to optimize the number and efficiency of operation steps while reducing the total number of required primitives, thus simplifying the entire operation process. These primitives effectively utilize the friction between the fabric and the contact surface, as well as the physical properties of the fabric and the dynamics of the action itself, thereby improving processing effectiveness and efficiency.
[0058] Specifically, this embodiment provides four types of action primitives, including the following:
[0059] Fling motion primitive: This action is performed by a dual-arm robot, which includes grasping the clothing and stretching it above the workspace. Once a preset force threshold is reached, it performs a dynamic swing to throw the clothing toward a specific location within the workspace. This action is suitable for quickly unfolding fabric, especially in situations requiring rapid handling of large quantities of fabric.
[0060] Pick / Place Action Primitives: These are primitives that describe the picking action performed by a single-arm robot, accompanied by the second arm pressing down on the fabric at a point along the placement posture extension line. These primitives enable the robot to correct localized errors, such as folded corners or sleeves on clothing, and are particularly useful for fine adjustments to fabric.
[0061] Drag action primitive: The robot drags the fabric a fixed distance away from its center based on two pick-up points, using the friction between the fabric and the workspace to smooth wrinkles or adjust the fabric's position. This action is suitable for handling complex folds or wrinkles in fabric, improving the accuracy and efficiency of fabric handling;
[0062] Fold motion primitives: These are the collaborative picking and placing actions performed by two robotic arms. Key point detection determines the picking and placing postures, enabling precise folding. These primitives are applicable to fine folding tasks on various fabrics, adapting to different sizes and materials, and providing efficient and accurate folding results.
[0063] Step 3: Decode the defined action primitives using VIT-Transformer. The core of this invention lies in employing an operation policy generator based on VIT-Transformer (an advanced convolutional neural network, CNN). In this embodiment, this policy generator focuses on the flattening of the fabric.
[0064] A key component of this invention is the Transformer module, specifically designed to process and parse image data of fabric to generate precise operational instructions. This module includes the following main steps:
[0065] (1) Image preprocessing and segmentation:
[0066] First, the fabric image data is divided into equally sized image patches, which capture local features of the fabric. For example... Figure 2 As shown:
[0067] (2) Linear mapping and embedding generation:
[0068] After each image patch is flattened, it is input into a linear mapping layer (fully connected layer), where it is mapped into high-dimensional vectors (embeddings) for subsequent depth processing.
[0069] (3) Introduction of Class Tokens:
[0070] To integrate global information, the system introduces a class token before the embedded image patch sequence. This is a learnable tensor with the same dimension as the image patch embedding.
[0071] (4) Application of position encoding:
[0072] The system assigns positional codes to the category tokens and features of each image patch. These codes are learnable, ensuring that the model can understand the relative position of the image patch within the entire fabric.
[0073] (5) Input to the Transformer encoder:
[0074] The sequence consisting of the category token with added positional encoding and image patch features is fed into the Transformer encoder. The encoder processes this information through a multi-head self-attention mechanism and multilayer perceptron blocks.
[0075] (6) Extracting and applying the embedding of category symbols:
[0076] In the encoder's output, the first embedding (i.e., the embedding corresponding to the category token) is extracted. This embedding contains overall cloth information and is input into the system's decoder. The decoder further parses this information to generate specific action primitives, such as the Fling, Drag, and Pick action primitives.
[0077] Through the processing steps of this module, the Transformer structure of this invention can accurately analyze complex fabric images and generate detailed operational instructions, guiding the robot to perform unfolding and folding tasks in an efficient and accurate manner. This method not only improves operational efficiency but also enhances the robot's adaptability and accuracy when handling different types of fabrics.
[0078] Step 4: Use the Swing-Transformer to identify and label the key points of the fabric flattened in Step 3, select the folding action primitive with the highest probability value, and execute it through a dual-arm robot.
[0079] Specifically, in this invention, the key point recognition module plays a crucial role. By employing a key point-based clothing representation method, this embodiment extracts features from the visual data of the fabric, simplifying the infinitely free state of the fabric into a finite key point representation. This method significantly improves processing efficiency and system adaptability, enabling the robot to handle various types of fabrics more accurately and efficiently.
[0080] (1) Feature extraction and key point detection:
[0081] The core technology used in this step is keypoint detection and recognition via the Swin-Transformer feature extraction module. This Swin-Transformer-based architecture is particularly suitable for processing complex image data because it can effectively capture key information in the image and perform deep analysis. The image data processed by the Swin-Transformer is then fed into a convolutional layer, which is responsible for outputting the number of keypoints. This process ensures accurate keypoint recognition, providing crucial information for subsequent steps.
[0082] (2) Real-world data collection and annotation:
[0083] To train and optimize the keypoint detection model, this embodiment acquired images of various fabrics from real-world scenes. This process utilized a top-mounted depth camera to ensure high-quality and representative image data. The acquired images were then processed using LabelMe annotation software, where keypoints on the fabrics were manually annotated. This step is fundamental to training the model, ensuring data accuracy and model effectiveness.
[0084] (3) Training and application of key point detection model:
[0085] Using labeled data, this embodiment trained a specialized fabric keypoint detection model. This model can effectively detect keypoints on the flattened fabric and accurately obtain the number of keypoints. Based on the number of detected keypoints, the system can determine whether to initiate the folding procedure. When the number of keypoints matches the preset target, the system will enter the folding operation stage.
[0086] Step 5: Based on the detection and recognition results of the Swin-Transformer module, generate an operation strategy, specifically a folding action, which is performed by a robotic arm to fold the entire fabric.
[0087] Choose the optimal action between dynamic throwing and quasi-static actions (such as picking up and placing, picking up and dragging). Figure 3 As shown:
[0088] The generation of operational strategies is based on a self-supervised value network, which uses a VIT-Transformer to learn a value map, where actions are defined on a pixel grid. The value network guides the action selection process by predicting the expected effect of each potential action. Specifically, the value network regresses each pixel in the value map to incremental coverage (i.e., the change in fabric coverage before and after executing an action), incremental smoothness (i.e., the change in fabric surface smoothness before and after executing an action), and the change in the number of keypoints.
[0089] During fabric processing, the system flattens the fabric using a series of predefined action primitives. These action primitives, such as adjusting fabric position or smoothing wrinkles, are parsed and selected by the decoder from the VIT-Transformer network output. The decoder selects the action with the highest probability value for each action primitive to maximize fabric coverage and smoothness, while also considering changes in the number of keypoints. When the fabric is successfully flattened and meets the following conditions, the system enters the folding operation phase: smoothness greater than a specific threshold s, coverage higher than a standard c, and the number of detected keypoints consistent with a preset target. In this phase, the system uses fold primitives to perform a folding operation on the fabric based on the keypoint positions.
[0090] The reward function is a combination of these three factors, where the incremental smoothness R... smooth,t+1 R key,t+1 Changes in the number of key points and changes in coverage R cov,t+1 It can be expressed as shown in formula (4), where λ1, λ2 and λ3 are weight parameters used to adjust the relative importance of these three factors in the reward function.
[0091] R all,t+1 =λ1R smooth,t+1 +λ2R cov,t+1 +λ3R key,t+1 (4)
[0092] Regarding smoothness, this embodiment of the invention employs an innovative method to measure fabric smoothness, overcoming the analytical difficulties caused by the different fabric images obtained each time. This method utilizes depth images acquired from a top camera to determine the smoothness of the fabric surface by analyzing its flatness. Specifically, this embodiment selects the top n highest-resolution pixels in the depth image each time and calculates the variance of these points relative to the working surface. As shown in Formula 5, where d is a single depth value in the depth dataset, and in a given time step t or t+1, the depth dataset D contains multiple such depth values d. Each d t,i Or d t+1,i This represents the depth value of the i-th pixel at time t or t+1. D t and D t+1 This indicates that at time step t or t+1, the depth dataset D contains the depth information of all pixels at that time point. μ is the average depth value.
[0093] This variance value measures the smoothness of the fabric; a larger value indicates a less smooth surface, and vice versa. This method provides a simple and effective quantitative indicator for evaluating the processing status of the fabric, helping this embodiment make more accurate decisions in subsequent processing steps.
[0094]
[0095] The calculation of keypoint reward and coverage is shown in formulas (6) and (7), where cov represents the maximum coverage of the working plane occupied by the image obtained by the top depth camera, k represents the number of keypoints detected at time t+1, and m is the number of predefined keypoints for the fabric.
[0096] R cov,t+1 =cov t+1 -cov t (6)
[0097]
[0098] Through this self-supervised learning method, the present invention effectively achieves the flattening and folding of fabric, providing a flexible and precise way to handle complex deformable objects. This strategy provides powerful support for robots performing delicate operations such as flattening and folding.
[0099] The preferred embodiments of the present invention have been described in detail above. It should be understood that those skilled in the art can make numerous modifications and variations based on the concept of the present invention without creative effort. Therefore, all technical solutions that can be obtained by those skilled in the art based on the concept of the present invention through logical analysis, reasoning, or limited experimentation on the basis of existing technology should be within the scope of protection defined by the claims.
Claims
1. A method for manipulating flexible objects based on key point detection, characterized in that, Includes the following steps: Step 1: Construct a workspace for the dual-arm robot. The workspace includes a top depth camera, which captures real-time visual data of the fabric state and the surrounding environment within the manipulation area. Step 2: Define multiple action primitives; Step 3: Use a visual converter based on the Transformer model to decode various action primitives, generate operation strategies, and output the actions to be executed. Step 4: Use a translation window transformer based on the Transformer model to detect and identify key points on the fabric, select an operation strategy, and generate action instructions. Step 5: Send the motion command to the dual-arm robot, and have the dual-arm robot execute the motion command; The various action primitives include pick-up, place, throw, drag, and fold action primitives; The folding action primitive is executed by a dual-arm robot, which determines the picking and placing postures through key point detection to achieve the folding action; A reward function is constructed to measure the smoothness, coverage, and number of keypoints of the folding action, expressed by the following formula: in, For the reward function, For smoothness bonus items, For coverage bonus items, The reward is for the number of key points. , and These are weight parameters.
2. The flexible object manipulation method based on key point detection according to claim 1, characterized in that, The picking and placing action primitives are executed by a single-arm robot, accompanied by a second arm pressing the fabric at a point on the placement posture extension line.
3. The flexible object manipulation method based on key point detection according to claim 1, characterized in that, The throwing action primitive is executed by a dual-arm robot. It grabs the clothing according to the given picking posture, lifts and stretches the clothing above the workspace until it reaches the preset force threshold, and then the two arms swing dynamically while gradually lowering the height to bring the clothing toward the workspace, thus completing the throwing action.
4. The flexible object manipulation method based on key point detection according to claim 1, characterized in that, The dragging action primitive is executed by a dual-arm robot, which drags the fabric away from the center of the fabric by a fixed distance based on two pick-up points, and uses the friction between the fabric and the workspace to smooth wrinkles or adjust the position of the fabric.
5. The flexible object manipulation method based on key point detection according to claim 1, characterized in that, The smoothness is expressed by the following formula: in, , Indicates time t, t+1 The Middle i The depth value of each pixel. , Representing a deep dataset D At time step t , t+1 Depth information of all pixels at a given time point. It is the average depth value; The coverage rate is expressed by the following formula: in, This indicates that the image obtained by the depth camera is in t+1 The maximum coverage of the workspace occupied at a given time point. This indicates that the image obtained by the depth camera is in t The maximum coverage of the workspace occupied at a given time point; The number of key points is expressed by the following formula: in, k Indicates in t+1 The number of key points detected at any given time m This is the preset number of key points.
6. A flexible object manipulation device based on key point detection, characterized in that, The device is used to implement the flexible object manipulation method based on key point detection as described in any one of claims 1-5, comprising: The visual data acquisition module is used to capture visual data of the fabric status and surrounding environment in the manipulation area in real time. The action primitive generation module is used to define various action primitives; The cloth feature processing module of the visual converter based on the Transformer model is used to decode action primitives, generate operation strategies, and output the actions to be executed. The key point detection and recognition module based on the Transformerd model translation window transformer is used to detect and recognize key points on the fabric, select operation strategies, and generate action instructions. The operation strategy generation and execution module is used to execute action instructions.