Randomly stacked material identification and positioning method and system based on 3D vision technology
By using a deep learning model based on 3D vision technology, combined with SAM model and PPF feature matching, the problem of identifying and locating disordered materials is solved, achieving accurate identification and location of materials, and is suitable for automated grasping and assembly tasks in complex environments.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- SHANGHAI FLEXIV ROBOTICS TECH CO LTD
- Filing Date
- 2025-09-11
- Publication Date
- 2026-07-02
AI Technical Summary
Existing automated systems struggle to effectively identify and locate disordered stacked materials, especially in complex scenarios where material segmentation is difficult, material pose recognition is inaccurate, and they cannot adapt to dynamic scene changes, resulting in low processing efficiency.
By combining 3D vision technology with a deep learning model, the 3D model of the material object is converted into point cloud data, downsampling is performed to generate template point cloud, the SAM model is used for material segmentation, and 6D pose estimation of the material is achieved through PPF feature matching.
It improves the accuracy and efficiency of identifying disordered stacked materials, adapts to various materials and diverse application scenarios, and can perform automated grasping and assembly tasks in complex environments, achieving precise material positioning and instance segmentation.
Smart Images

Figure CN2025120618_02072026_PF_FP_ABST
Abstract
Description
A Method and System for Identifying and Locating Disordered Materials Based on 3D Vision Technology Technical Field
[0001] This invention relates to the field of robot automation technology, specifically to a method and system for identifying and locating disordered materials based on 3D vision technology. Background Technology
[0002] In automated production and warehousing, robots are commonly used for material handling and assembly tasks. However, most existing automation systems rely on the orderly placement of materials or require complex teaching processes to specify the precise position and orientation of the materials. This not only increases the complexity of the operation but also reduces the flexibility and efficiency of the system.
[0003] While traditional 2D vision systems can identify surface features of objects, their accuracy and robustness are significantly limited when dealing with complex stacking, partial occlusion, and disordered material placement scenarios. Traditional 3D vision systems, although capable of generating 3D models of materials by acquiring depth information and thus more accurately identifying their position and orientation, also face several limitations, including: difficulty in segmenting materials in complex scenes, as disordered stacked materials often occlude each other, making traditional rule-based methods ineffective; inaccurate material pose recognition: existing systems often rely on manual labeling and teaching to determine object pose, leading to low processing efficiency, especially when materials have complex or irregular shapes; and inability to handle dynamic scenarios: in production or warehousing environments, the state of materials may constantly change, making it difficult for existing systems to adjust gripping and assembly strategies in real time.
[0004] Therefore, how to effectively utilize 3D vision technology combined with deep learning models to solve the problem of identifying and locating disordered stacked materials has become an important challenge in the field of automation.
[0005] Patent document CN117001404A (application number: 202311051378.X) discloses an automatic loading and unloading device for machine tools, belonging to the field of mechanical processing technology; it includes a dual-station processing module, a material storage module, a vision module, and a quick-change gripper module. The material storage module includes a loading tray and a unloading tray. The vision module is a 3D camera. The 3D camera uses a target recognition and positioning algorithm to match and position the CAD model of the material with the 3D image, quickly and accurately obtaining the 3D pose of the material. A six-axis robot is set between the dual-station processing modules. Summary of the Invention
[0006] To address the shortcomings of existing technologies, the purpose of this invention is to provide a method and system for identifying and locating disordered materials based on 3D vision technology.
[0007] A method for identifying and locating disordered materials based on 3D vision technology according to the present invention includes:
[0008] Step S1: Convert the 3D model of the material object into point cloud data, perform downsampling on the point cloud data to generate a template point cloud of the material object, and calculate the PPF feature of the template point cloud of the material object.
[0009] Step S2: In the disordered stacked material scene, based on the acquired RGB image and depth image of the material scene, the SAM model is used to segment the material to obtain the material point cloud of each instance, and the PPF feature of the material point cloud in the material scene is calculated.
[0010] Step S3: Match the PPF features of the material point cloud in the material scene with the PPF features of the template point cloud to achieve 6D pose estimation for each material.
[0011] Preferably, step S1 includes:
[0012] Step S1.1: Convert the 3D model of the material object into point cloud data;
[0013] Step S1.2: The point cloud data is downsampled using the voxel grid method to obtain downsampled point cloud data;
[0014] Step S1.3: The downsampled point cloud data is downsampled using a geometric feature-based downsampling technique to obtain the processed point cloud data, which generates a template point cloud for the material object.
[0015] Step S1.4: Calculate the PPF features of the material object template point cloud based on the processed point cloud data.
[0016] Preferably, step S1.2 includes:
[0017] Step S1.2.1: Divide the point cloud data of the material object into multiple voxels according to the preset voxel size;
[0018] Step S1.2.2: Within each voxel, select the geometric center point within the voxel as the representative point to achieve downsampling processing;
[0019] Among them, S j Denotes the set of points within each voxel, (x i ,y i ,z i ) represents the spatial coordinates of each point within a voxel.
[0020] Preferably, step S1.3 includes:
[0021] Step S1.3.1: Calculate the normal vector for each point selected using the voxel mesh method;
[0022] Where, N i Point P i The neighborhood point set, n j For the neighborhood point P j The normal vector, |N i | represents the number of neighboring points;
[0023] Step S1.3.2: Calculate the curvature k of each point selected using the voxel mesh method. i ;
[0024] Where, r i It is point P i The radius of curvature at a point indicates the degree of curvature around that point;
[0025] Step S1.3.3: Based on the changes in normal vector and curvature, determine whether a point is redundant, delete redundant points, and retain points that meet the preset requirements to achieve downsampling processing;
[0026] The change of the normal vector is obtained by calculating the angle between the normal vectors of adjacent points: cosθ=n1·n2 / (|n1||||n2||))
[0027] Based on the changes in the normal vector and the curvature, a composite metric is calculated. This composite metric is then compared to a set threshold; if the composite metric exceeds the threshold, it is considered a redundant point. S = α·Δn + β·κ
[0028] Where α represents the weight of the change in the normal vector; β represents the curvature weight; Δn represents the change in the normal vector; and k represents the curvature.
[0029] Preferably, step S1.4 includes:
[0030] Step S1.4.1: For the downsampled point cloud, calculate p for each point. i The normal vector N(p) i ); N(p i ) = normal(p i )
[0031] Where, normal represents P i Point normal vector, calculated by P i The neighborhood of point n i The normal vector is obtained;
[0032] Step S1.4.2: For the downsampled point cloud, calculate the distance between any two points (p...i ,p j The point-to-point feature PPF of ) is: F(m1,m2)=(||d||,∠(n1,d),∠(nx,d),∠(n1,n2)) T PPF ={F(m i m j )|(m i m j )∈M,i≠j}
[0033] Where F(m1,m2) represents any two points (p i ,p j The point-to-point feature PPF of the model is given by d; d represents the Euclidean distance between two points; ∠(n1,n2) represents the angle between vectors; ∠(n1,d) and ∠(n2,d) represent the angle between the normal vector and the point-to-point vector; M represents a point on the model.
[0034] Preferably, step S2 includes:
[0035] Step S2.1: Based on the RGB-D camera, acquire the RGB image and depth information of the material scene;
[0036] Step S2.2: Preprocess the acquired RGB image and depth information of the material scene to obtain the preprocessed RGB image and depth information of the material scene;
[0037] Step S2.3: The RGB image of the preprocessed material scene is used to identify each material object using the SAM model and generate a binary mask;
[0038] Step S2.4: Based on the binary mask of each material object and the depth information of the preprocessed material scene, calculate the depth information containing only the material area;
[0039] Step S2.5: Calculate point cloud information based on depth information containing only the material area and camera intrinsic parameters; Z = Depth(u,v)
[0040] Among them, f x and f y It is the focal length of the camera in the x and y directions; c x and c y These are the optical center coordinates of the image; (X,Y,Z) represents the point cloud coordinates, and Depth(u,v) represents the depth value at (u,v), which represents the distance from the camera to the object surface.
[0041] Step S2.6: Calculate the PPF features of the current material point cloud.
[0042] Preferably, step S2.2 includes:
[0043] Step S2.2.1: Adjust the brightness distribution of the RGB image through histogram equalization so that the contrast of the adjusted RGB image meets the preset requirements;
[0044] Step S2.2.2: Normalize the depth information of each pixel to obtain normalized depth information;
[0045] Where Z represents, Z min Z represents the minimum depth value. max Indicates the maximum depth value;
[0046] Step S2.2.3: Map the G channel in the adjusted RGB pixels to the normalized depth information;
[0047] Step S2.2.4: Calculate the normal vector information of each pixel using the normalized depth information;
[0048] Where u′ represents the horizontal coordinate of the image and v′ represents the vertical coordinate of the image;
[0049] Step S2.2.5: Normalize the normal vector information and map it to the B channel of the adjusted RGB pixels.
[0050] Preferably, step S3 includes:
[0051] Step S3.1: Select any reference point s from the surface of an object in the material scene. r And find the corresponding point m in the material model. r ;
[0052] Step S3.2: Define the material model relative to the reference point s r A local coordinate system is established so that any reference point s in the material scene can be selected. r Find the corresponding point m in the material model r Align the position information and normal vector;
[0053] Step S3.3: Select the point pair (s) in the material scene. r s i Material model point pairs (m) ∈ S with similar eigenvectors F r ,m i )∈M;
[0054] Step S3.4: Based on the rotation matrix, point m in the material model i And point s in the material scenario i Alignment ensures that the points in the material model and the points in the material scene are geometrically identical.
[0055] in, This represents the inverse transformation matrix that converts points in the global coordinate system back to the scene's local coordinate system. It reverses the transformation from the scene to the global coordinate system, remapping globally aligned points back to the scene's local coordinate system; R x (a) represents the rotation matrix T, which rotates the x-axis by an angle α. m→g This represents the transformation matrix that transforms material model points from the local coordinate system to the global coordinate system.
[0056] Step S3.5: Obtain the optimal local coordinate system through the generalized Hough voting method, so that the maximum number of points in the material scene match the material model, thereby determining the orientation of the material.
[0057] Preferably, step S3.5 includes:
[0058] Step S3.2.1: Define a two-dimensional accumulator array (N m N angle ); where N m N represents the number of sampling points in the model. angle The sampling step size represents the rotation angle α;
[0059] Step S3.2.2: For reference point s in the material scenario r In addition to the reference point s, the scene selection also includes... r Other points besides these points form point pairs (s) r ,s i ), and calculate feature F for each pair of points. s (s r ,s i The feature F s (s r s i This includes the distance and relative direction between point pairs;
[0060] Step S3.2.3: Calculate the feature F s As the key, it is used to search within the global material model to find the corresponding material model point pair (m). r ,m i ), to give them similar characteristics;
[0061] Step S3.2.4: For each matched material model point pair (m) r ,m i), calculate the rotation angle α required to align the material model with the material scene; for each calculated angle α, increment the corresponding cell in the accumulator array by 1; after the entire process is complete, determine the optimal local coordinate system based on the position with the most votes in the accumulator array, and determine the best combination of material model point and rotation angle (m r , α), to obtain the posture of the material in the scene.
[0062] The present invention provides a disordered material identification and positioning system based on 3D vision technology, comprising:
[0063] Module M1: Used to convert the 3D model of a material object into point cloud data, perform downsampling processing on the point cloud data, generate a template point cloud of the material object, and calculate the PPF feature of the template point cloud of the material object;
[0064] Module M2: Used to segment materials in disordered stacked material scenarios based on the acquired RGB and depth images of the material scenario using the SAM model, obtain the material point cloud for each instance, and calculate the PPF features of the material point cloud in the material scenario.
[0065] Module M3: Used to match the PPF features of the material point cloud in the material scene with the PPF features of the template point cloud to achieve 6D pose estimation for each material.
[0066] Compared with the prior art, the present invention has the following beneficial effects:
[0067] 1. This invention combines deep learning models with 3D vision technology to effectively improve the accuracy and efficiency of identifying disordered stacked materials, and can be widely used in automated grasping and assembly tasks in complex environments such as warehouses and production lines.
[0068] 2. This invention uses the SAM model to segment materials based on RGB data and depth data, significantly improving the segmentation effect of the SAM model;
[0069] 3. The present invention uses the SAM model to perform deep learning segmentation on the input RGB-D image and generate a binary mask for each material instance. By combining depth data, it obtains depth information containing only the material region, enabling the SAM model to accurately segment materials in complex and disordered stacking scenarios and extract the independent point cloud information of each material.
[0070] 4. In order to improve the accuracy and efficiency of segmentation, this invention adds image preprocessing before the SAM model segmentation. By fusing the RGB image with the Depth image, the RGB information is preserved while the depth and geometric information are added. Furthermore, the RGB image is processed to remove noise and enhance color, which effectively improves the segmentation effect of the SAM model.
[0071] 5. This invention can adapt to a variety of materials and diverse application scenarios. The PPF feature depends only on geometric information and is applicable to objects of various shapes and sizes, without being limited by specific material types. By dynamically constructing PPF templates, this invention can be widely applied in various scenarios such as industry and logistics.
[0072] 6. This invention can obtain accurate 6D pose information of materials through a single recognition, without relying on secondary positioning. After being grasped by a robot, subsequent assembly tasks can be carried out directly.
[0073] 7. This invention combines the voxel grid method with geometric feature-based downsampling technology. The voxel grid method is used to reduce redundant data over a large area, while geometric feature downsampling ensures that the key geometric features of the material are preserved by retaining points in important geometric feature regions. Attached Figure Description
[0074] Other features, objects, and advantages of the present invention will become more apparent from the following detailed description of non-limiting embodiments with reference to the accompanying drawings:
[0075] Figure 1 is a flowchart of a method for identifying and locating disordered materials based on 3D vision technology.
[0076] Figure 2 shows a comparison of the model point cloud before and after sampling.
[0077] Figure 3 is a schematic diagram of the point-to-point feature calculation principle for two directional points.
[0078] Figure 4 shows the effect of material segmentation.
[0079] Figure 5 shows the point cloud segmentation effect.
[0080] Figure 6 is a diagram illustrating the calculation principle of the transformation between model and scene coordinates.
[0081] Figure 7 shows the effect of matching the model point cloud with the scene point cloud. Detailed Implementation
[0082] The present invention will now be described in detail with reference to specific embodiments. These embodiments will help those skilled in the art to further understand the present invention, but do not limit the invention in any way. It should be noted that those skilled in the art can make several changes and improvements without departing from the concept of the present invention. These all fall within the protection scope of the present invention.
[0083] Example 1
[0084] A method for identifying and locating disordered materials based on 3D vision technology according to the present invention, as shown in Figure 1, includes:
[0085] Step 1: Convert the 3D model of the material object into point cloud data, perform downsampling processing on the point cloud data to generate a template point cloud of the material object, and calculate the PPF feature of the template point cloud of the material object; In this embodiment, the 3D model includes CAD and Mesh models.
[0086] Specifically, step 1 includes:
[0087] In this embodiment, the voxel grid method and the geometric feature-based downsampling technique are combined. The voxel grid method is used to reduce redundant data over a large area, while the geometric feature downsampling ensures that the key geometric features of the material are preserved by retaining points in important geometric feature regions.
[0088] More specifically, firstly, based on a preset voxel size, the point cloud space is divided into multiple voxels, and the voxel size V is set. size This determines the spatial extent of each voxel. Larger voxels reduce the number of representative points but may lose some geometric details; smaller voxels retain more details but increase computation. Then, within each voxel, the geometric center point within the voxel is selected as the representative point. In the voxel mesh method, each point P in the point cloud data... i =(x i y i , z i The point set S within each voxel is mapped to the nearest voxel grid center; j Where j represents the voxel number; the voxel center point C can be calculated using the following formula. j :
[0089] For all points selected through voxel sampling, calculate their normal vector n. i The normal vector reflects the geometric properties of the surface at that point, and its calculation can be accomplished by weighted averaging of local neighborhood points. Let the normal vector of each point be calculated using the following formula:
[0090] Where, N i Point P i The neighborhood point set, n j For the neighborhood point P j The normal vector, |N i | represents the number of neighboring points.
[0091] Then calculate the curvature k at each point. i Curvature reflects the degree of curvature of a surface at a given location; generally, the edges or corners of a material have greater curvature, while flat areas have less curvature; this can be achieved through the curvature formula:
[0092] Where, ri It is point P i The radius of curvature at a point indicates the degree of curvature around that point.
[0093] Based on the changes in the normal vector and curvature, determine whether a point is redundant; specifically:
[0094] The change in normal vector is obtained by calculating the angle between the normal vectors of adjacent points using cosθ = n1·n2 / (||n1||||n2||)). A composite metric is introduced to determine the importance of a point: S = α·|ΔN| + β·κ, where α is the weight of the change in normal vector, β is the curvature weight, and Δn is the change in normal vector. In this embodiment, θ is Δn at this location. By configuring different α and β, redundant points of materials with different geometric structures can be filtered out.
[0095] By combining the voxel mesh method with geometry-based downsampling techniques, we can preserve as many geometrically rich points as possible in the 3D model, obtaining the downsampled material point cloud P = {p1, p2, ..., p...}. n}, where p i This represents each point in the material point cloud. Downsampling of the point cloud data can reduce redundant data and improve computational efficiency. As shown in Figure 2, this is a comparison of the example material before and after downsampling.
[0096] After sampling the point cloud P, calculate the value of each point p. i The normal vector N(p) i The calculation of normal vectors is based on the local geometry of the point cloud, typically by finding the neighboring points of the point and calculating the normal vectors of their surfaces; N(p i ) = normal(p i )
[0097] Where, normal represents P i Point normal vector, calculated by P i The neighborhood of point n i The normal vector is obtained;
[0098] Any two points (p) in the material point cloud i ,p j The point-pair feature (PPF) of the data is shown in Figure 3, and specifically includes:
[0099] Euclidean distance d between two points 12 =|m1-m2|
[0100] P1P2 represents the Euclidean distance between points P1 and P2, where (X1, Y1, Z1) and (X2, Y2, Z2) are the coordinates of the two points, respectively.
[0101] The angle between the vectors ∠(n1,n2)
[0102] The angles between the normal vector and the point-pair vector are ∠(n1,d) and ∠(n2,d).
[0103] Therefore, the point cloud information of the material can be represented as: F(m1,m2)=(||d||,∠(n1,d),∠(n2,d),∠(n1,n2))
[0104] All calculated PPF features F(m1,m2) will be stored in the template database to form a complete material template T. PPF T PPF ={F(m i ,m j )|(m i ,m j )∈M,i≠j}
[0105] The material template registration is now complete.
[0106] Step 2: In the disordered stacked material scene, based on the acquired RGB image and depth image of the material scene, the SAM model is used to segment the material to obtain the material point cloud of each instance, and the PPF feature of the material point cloud in the material scene is calculated.
[0107] Specifically, step 2 includes:
[0108] The SAM model is used to segment the RGB-D image of the input scene, generating a binary mask for each material instance, and the corresponding material point cloud P is extracted based on the binary mask. input .
[0109] Specifically, an RGB-D camera can simultaneously acquire RGB images and depth images of the material scene; the RGB image provides color information of the scene, while the depth image provides distance information from the object to the camera for each pixel, which provides the necessary data for point cloud generation.
[0110] The RGB image is preprocessed and then fed into the SAM model after being fused with Depth information. Specifically, the image brightness distribution is first adjusted by histogram equalization to improve the image contrast, so that the SAM model can better distinguish between the background and the material. Then, the RGB and Depth information are fused together.
[0111] Specifically, RGB and Depth were fused in two ways, including:
[0112] Pixel-level blending: First, normalize the depth value of each pixel to the [0,1] range, using the following formula. Then, the G channel in the RGB pixels is mapped to normalized depth information. This adds depth information to the RGB information.
[0113] Feature-level fusion: The normal vector information for each pixel is calculated using the depth image. When calculating the normal vector, the depth difference between adjacent pixels in the depth image can be utilized, and the normal vector of each pixel can be approximated using the following formula: The normal vector information is then normalized and mapped to the B channel of RGB pixels, which allows the geometric information in the image to be displayed better.
[0114] By using the two preprocessing methods described above, RGB and depth data can be combined, preserving RGB information while adding depth and geometric information. This can significantly improve the segmentation performance of the subsequent SAM model.
[0115] The SAM model is a deep learning-based segmentation model capable of pixel-level classification of images, identifying different material instances within them. In this embodiment, the SAM model processes the fused RGB image, identifies each material object, and generates a binary mask for it. The mask distinguishes which parts are regions of interest; in this embodiment, the regions of interest are the material objects, and which parts are background or irrelevant areas. As shown in Figure 4, the root material is then segmented.
[0116] Based on the material mask information output by the SAM model and the depth data of the collected sample material, depth information containing only the material region can be calculated. Point cloud information can then be obtained using camera intrinsic parameters and point cloud computing formulas.
[0117] Camera internal parameters: Where f x and f y It is the focal length of the camera in the x and y directions, usually expressed in pixels, c. x and c y These are the coordinates of the optical center of the image, usually located at the center of the image.
[0118] Calculate point cloud: Z = Depth(u,v)
[0119] Where (u,v) is the position of the pixel in the image (usually two-dimensional pixel coordinates). Z = Depth(u,v) is the depth value of that pixel, representing the distance from the camera to the object surface. The final material point cloud data is shown in Figure 5.
[0120] The SAM model can accurately segment materials in complex and disordered stacking scenarios and extract the independent point cloud of each material.
[0121] Perform PPF feature calculation on material point clouds;
[0122] For each segmented material point cloud P input Repeat the normal vector calculation and point-pair feature PPF calculation to obtain the PPF feature set T of the material point cloud. input .
[0123] This process is consistent with the material template construction, generating PPF features of the material point cloud.
[0124] Step 3: Match the PPF features of the material point cloud in the material scene with the PPF features of the template point cloud to achieve 6D pose estimation for each material.
[0125] Step S3.1: Define the material model relative to any reference point s on the surface of an object in the scene. r The local coordinate system allows the material model and the scene to be aligned on local features; the rotation matrix ensures that the materials in the material model and the scene are completely consistent in geometric relationships.
[0126] Specifically, step 3.1 includes:
[0127] Local coordinate system definition: Select a reference point s from the scene. r And assume it is on the surface of the object. Then find a corresponding point m in the model. r The purpose of this process is to map points in the scene to points in the model. Next, to ensure the alignment of the positional information and normal vectors of these two points, the surface normal direction of the model must be aligned with the normal direction in the scene. Therefore, the rigid body transformation from model space to scene space can be achieved through a combination of a reference point in the model and a rotation angle (m...). r To explain this, we define the model relative to the reference point s. r The Local Coordinates section addresses position and orientation by defining the Local Coordinates system, enabling the model and scene to align on specific local features.
[0128] Model-to-scene transformation: Given a reference point s r Select point pairs (s) in the scene r s i Points ∈ S have similar eigenvectors F. Here, the eigenvector F describes the "relative distance" and "relative direction" between two points, that is, relative to the model point pair (m).r m i The similarity of points () ∈ M. This feature is used to find pairs of points in the scene that are similar to their counterparts in the model for matching.
[0129] Specifically, through the transformation matrix T m→g Point m in the model r Move to the origin in the local coordinate system. Simultaneously, rotate the model so that its normal vector... Align with the x-axis of the local coordinate system. The purpose of this step is to normalize the reference points in the model so that they can be used as a reference for subsequent alignment.
[0130] Through the corresponding transformation matrix T s→g Similar processing is applied to points in the scene, aligning reference points in the scene to the origin of the local coordinate system and ensuring that their directions are consistent with the model's normal vector. This step is to ensure that the model and scene are compared under the same reference.
[0131] Finally, another point m in the model i Rotate along the X-axis so that it aligns with point s in the scene. i Alignment. This rotation is determined by the rotation matrix R. x (α) is completed, the purpose of which is to ensure that the points in the model and the points in the scene are completely consistent in geometric relationship.
[0132] Through the above steps, the transformation from model to scene can finally be defined as follows: As shown in Figure 6, where T m→g This represents the transformation matrix that transforms model points from the local coordinate system to the global coordinate system. This transformation includes translation and rotation, aligning the model points to the global coordinate system. R x (α) represents the rotation matrix that rotates the x-axis by an angle α. This represents the inverse transformation matrix that converts points in the global coordinate system back to the scene's local coordinate system. It reverses the transformation from the scene to the global coordinate system, remapping globally aligned points back to the scene's local coordinate system.
[0133] Step 3.2: Calculate the object pose. The optimal local coordinate system is found using the generalized Hough voting method, ensuring that the maximum number of points in the scene match the model's position within the scene, thus determining the object's pose. This is to determine the optimal position and pose of the model in the scene. After defining the local coordinate system, a two-dimensional accumulator array is used, where the number of rows is N. m This represents the number of sampling points in the model; the number of columns, N. angle The sampling step size represents the rotation angle α.
[0134] This accumulator array resembles a two-dimensional table, where each row represents a point in the model and each column represents a different rotation angle. For a reference point s in the scene... r Select other points in the scene to form point pairs (s r ,s i ), and calculate feature F for each pair of points. s (s r ,s i This feature includes the distance and relative orientation between point pairs, with the aim of finding geometric features in the scene that can match the model.
[0135] The calculated feature F s As a key, it is used to search within the global model description to find the corresponding model point pair (m). r ,m i This makes them have similar characteristics (including distance and normal vector).
[0136] For each matched model point pair (m) r ,m i ), through the previous formula Calculate the rotation angle α required to align the model with the scene. This rotation angle represents the rotation the model needs to make along a certain axis to match the scene point pairs. For each calculated angle α, "vote" (+1) is applied to the corresponding cell in the accumulator array. This voting process is similar to accumulating the number of matches in a two-dimensional table to determine which position and rotation angle combination best matches all the points. When the entire process is complete, the position with the most votes in the accumulator array represents the optimal model point and rotation angle combination (m). r This is the optimal local coordinate system (α, α). This combination can be used to represent the pose of an object in the scene.
[0137] This determines the optimal 6D pose of the material in the current scene, namely the rotation matrix R and the translation vector t, where the translation vector t is the difference between the XYZ coordinates of the midpoint of the model and the midpoint of the scene. This enables accurate recognition of the material's pose. Figure 7 shows the visualization of the point cloud after matching the input material A with the template point cloud of the downsampled material 3D model.
[0138] Example 2
[0139] Example 2 is a preferred example of Example 1.
[0140] A disordered material identification and positioning system based on 3D vision technology according to the present invention includes:
[0141] Module 1: Converts the 3D model of the material object into point cloud data, performs downsampling processing on the point cloud data to generate a template point cloud of the material object, and calculates the PPF feature of the template point cloud of the material object; in this embodiment, the 3D model includes CAD and Mesh models;
[0142] Specifically, module 1 includes:
[0143] In this embodiment, the voxel grid method and the geometric feature-based downsampling technique are combined. The voxel grid method is used to reduce redundant data over a large area, while the geometric feature downsampling ensures that the key geometric features of the material are preserved by retaining points in important geometric feature regions.
[0144] More specifically, firstly, based on a preset voxel size, the point cloud space is divided into multiple voxels, and the voxel size V is set. size This determines the spatial extent of each voxel. Larger voxels reduce the number of representative points but may lose some geometric details; smaller voxels retain more details but increase computation. Then, within each voxel, the geometric center point within the voxel is selected as the representative point. In the voxel mesh method, each point P in the point cloud data... i =(x i ,y i ,z i The point set S within each voxel is mapped to the nearest voxel grid center; j Where j represents the voxel number; the voxel center point C can be calculated using the following formula. j :
[0145] For all points selected through voxel sampling, calculate their normal vector n. i The normal vector reflects the geometric properties of the surface at that point, and its calculation can be accomplished by weighted averaging of local neighborhood points. Let the normal vector of each point be calculated using the following formula:
[0146] Where, N i Point P i The neighborhood point set, n j For the neighborhood point P j The normal vector, |N i | represents the number of neighboring points.
[0147] Then calculate the curvature k at each point. i Curvature reflects the degree of curvature of a surface at a given location; generally, the edges or corners of a material have greater curvature, while flat areas have less curvature; this can be achieved through the curvature formula:
[0148] Where, ri It is point P i The radius of curvature at a point indicates the degree of curvature around that point.
[0149] Based on the changes in the normal vector and curvature, determine whether a point is redundant; specifically:
[0150] The change in normal vector is obtained by calculating the angle between the normal vectors of adjacent points using cosθ = n1·n2 / (||n1||||n2||)). A composite metric is introduced to determine the importance of a point: S = α·|Δn| + β·κ, where α is the weight of the change in normal vector, β is the curvature weight, and Δn is the change in normal vector. In this embodiment, θ is the Δn at this location. By configuring different α and β, redundant points of materials with different geometric structures can be filtered out.
[0151] By combining the voxel mesh method with geometry-based downsampling techniques, we can preserve as many geometrically rich points as possible in the 3D model, obtaining the downsampled material point cloud P = {p1, p2, ..., p...}. n}, where p i This represents each point in the material point cloud. Downsampling of the point cloud data can reduce redundant data and improve computational efficiency. As shown in Figure 2, this is a comparison of the example material before and after downsampling.
[0152] After sampling the point cloud P, calculate the value of each point p. i The normal vector N(p) i The calculation of normal vectors is based on the local geometry of the point cloud, typically by finding the neighboring points of the point and calculating the normal vectors of their surfaces; N(p i ) = normal(p i )
[0153] Where, normal represents P i Point normal vector, calculated by P i The neighborhood of point n i The normal vector is obtained;
[0154] Any two points (p) in the material point cloud i p j The point-pair feature (PPF) of the data is shown in Figure 3, and specifically includes:
[0155] Euclidean distance d between two points 12 =|m1-m2|
[0156] P1P2 represents the Euclidean distance between points P1 and P2, where (X1, Y1, Z1) and (X1, Y1, Z1) are the coordinates of the two points, respectively.
[0157] The angle between the vectors ∠(n1, n2)
[0158] The angles between the normal vector and the point-pair vector are ∠(n1, d) and ∠(n2, d).
[0159] Therefore, the point cloud information of the material can be represented as: F(m1, m2) = (||d||, ∠(n1, d), ∠(n2, d), ∠(n1, n2))
[0160] All calculated PPF features F(m1, m2) will be stored in the template database to form a complete material template T. PPF T PPF ={F(m i m j )|(m i m j )∈M,i≠j}
[0161] The material template registration is now complete.
[0162] Module 2: In the scenario of disordered stacked materials, based on the acquired RGB and depth images of the material scenario, the SAM model is used to segment the materials to obtain the material point cloud of each instance, and the PPF features of the material point cloud in the material scenario are calculated.
[0163] Specifically, module 2 includes:
[0164] The SAM model is used to segment the RGB-D image of the input scene, generating a binary mask for each material instance, and the corresponding material point cloud P is extracted based on the binary mask. input .
[0165] Specifically, an RGB-D camera can simultaneously acquire RGB images and depth images of the material scene; the RGB image provides color information of the scene, while the depth image provides distance information from the object to the camera for each pixel, which provides the necessary data for point cloud generation.
[0166] The RGB image is preprocessed and then fed into the SAM model after being fused with Depth information. Specifically, the image brightness distribution is first adjusted by histogram equalization to improve the image contrast, so that the SAM model can better distinguish between the background and the material. Then, the RGB and Depth information are fused together.
[0167] Specifically, RGB and Depth were fused in two ways, including:
[0168] Pixel-level blending: First, normalize the depth value of each pixel to the [0,1] range, using the following formula. Then, the G channel in the RGB pixels is mapped to normalized depth information. This adds depth information to the RGB information.
[0169] Feature-level fusion: The normal vector information for each pixel is calculated using the depth image. When calculating the normal vector, the depth difference between adjacent pixels in the depth image can be utilized, and the normal vector of each pixel can be approximated using the following formula: The normal vector information is then normalized and mapped to the B channel of RGB pixels, which allows the geometric information in the image to be displayed better.
[0170] By using the two preprocessing methods described above, RGB and depth data can be combined, preserving RGB information while adding depth and geometric information. This can significantly improve the segmentation performance of the subsequent SAM model.
[0171] The SAM model is a deep learning-based segmentation model capable of pixel-level classification of images, identifying different material instances within them. In this embodiment, the SAM model processes the fused RGB image, identifies each material object, and generates a binary mask for it. The mask distinguishes which parts are regions of interest; in this embodiment, the regions of interest are the material objects, and which parts are background or irrelevant areas. As shown in Figure 4, the root material is then segmented.
[0172] Based on the material mask information output by the SAM model and the depth data of the collected sample material, depth information containing only the material region can be calculated. Point cloud information can then be obtained using camera intrinsic parameters and point cloud computing formulas.
[0173] Camera internal parameters: Where f x and f y It is the focal length of the camera in the x and y directions, usually expressed in pixels, c. x and c y These are the coordinates of the optical center of the image, usually located at the center of the image.
[0174] Calculate point cloud: Z = Depth(u,v)
[0175] Where (u,v) is the position of the pixel in the image (usually two-dimensional pixel coordinates). Z = Depth(u,v) is the depth value of that pixel, representing the distance from the camera to the object surface. The final material point cloud data is shown in Figure 5.
[0176] The SAM model can accurately segment materials in complex and disordered stacking scenarios and extract the independent point cloud of each material.
[0177] Perform PPF feature calculation on material point clouds;
[0178] For each segmented material point cloud P input Repeat the normal vector calculation and point-pair feature PPF calculation to obtain the PPF feature set T of the material point cloud. input .
[0179] This process is consistent with the material template construction, generating PPF features of the material point cloud.
[0180] Module 3: Match the PPF features of the material point cloud in the material scene with the PPF features of the template point cloud to achieve 6D pose estimation for each material;
[0181] Module S3.1: Defines the material model relative to any reference point s on the surface of an object in the scene. r The local coordinate system allows the material model and the scene to be aligned on local features; the rotation matrix ensures that the materials in the material model and the scene are completely consistent in geometric relationships.
[0182] Specifically, module 3.1 includes:
[0183] Local coordinate system definition: Select a reference point s from the scene. r And assume it is on the surface of the object. Then find a corresponding point m in the model. r The purpose of this process is to map points in the scene to points in the model. Next, to ensure the alignment of the positional information and normal vectors of these two points, the surface normal direction of the model must be aligned with the normal direction in the scene. Therefore, the rigid body transformation from model space to scene space can be achieved through a combination of a reference point in the model and a rotation angle (m...). r a) can be interpreted as the model relative to the reference point s. r The Local Coordinates section addresses position and orientation by defining the Local Coordinates system, enabling the model and scene to align on specific local features.
[0184] Model-to-scene transformation: Given a reference point s r Select point pairs (s) in the scene r ,s i Points ∈ S have similar eigenvectors F. Here, the eigenvector F describes the "relative distance" and "relative direction" between two points, that is, relative to the model point pair (m).r ,m i The similarity of points () ∈ M. This feature is used to find pairs of points in the scene that are similar to their counterparts in the model for matching.
[0185] Specifically, through the transformation matrix T m→g Point m in the model r Move to the origin in the local coordinate system. Simultaneously, rotate the model so that its normal vector... Align with the x-axis of the local coordinate system. The purpose of this step is to normalize the reference points in the model so that they can be used as a reference for subsequent alignment.
[0186] Through the corresponding transformation matrix T s→g Similar processing is applied to points in the scene, aligning reference points in the scene to the origin of the local coordinate system and ensuring that their directions are consistent with the model's normal vector. This step is to ensure that the model and scene are compared under the same reference.
[0187] Finally, another point m in the model i Rotate along the x-axis so that it aligns with point s in the scene. i Alignment. This rotation is determined by the rotation matrix R. x (α) is completed, the purpose of which is to ensure that the points in the model and the points in the scene are completely consistent in geometric relationship.
[0188] Through the above modules, the transformation from model to scene can ultimately be defined as follows: As shown in Figure 6, where T m→g This represents the transformation matrix that transforms model points from the local coordinate system to the global coordinate system. This transformation includes translation and rotation, aligning the model points to the global coordinate system. R x (α) represents the rotation matrix that rotates the x-axis by an angle α. This represents the inverse transformation matrix that converts points in the global coordinate system back to the scene's local coordinate system. It reverses the transformation from the scene to the global coordinate system, remapping globally aligned points back to the scene's local coordinate system.
[0189] Module 3.2: Calculate the object pose. The optimal local coordinate system is found using the generalized Hough voting method, maximizing the number of points in the scene that match the model portion with the maximum number of points on the model itself, thus determining the object's pose. This is to determine the optimal position and pose of the model in the scene. After defining the local coordinate system, a two-dimensional accumulator array is used, where the number of rows is N. m This represents the number of sampling points in the model; the number of columns, N. angle The sampling step size represents the rotation angle α.
[0190] This accumulator array resembles a two-dimensional table, where each row represents a point in the model and each column represents a different rotation angle. For a reference point s in the scene... r Select other points in the scene to form point pairs (s r ,s i ), and calculate feature F for each pair of points. s (s r ,s i This feature includes the distance and relative orientation between point pairs, with the aim of finding geometric features in the scene that can match the model.
[0191] The calculated feature F s As a key, it is used to search within the global model description to find the corresponding model point pair (m). r ,m i This makes them have similar characteristics (including distance and normal vector).
[0192] For each matched model point pair (m) r ,m i ), through the previous formula Calculate the rotation angle α required to align the model with the scene. This rotation angle represents the rotation the model needs to make along a certain axis to match the scene point pairs. For each calculated angle α, "vote" (+1) is applied to the corresponding cell in the accumulator array. This voting process is similar to accumulating the number of matches in a two-dimensional table to determine which position and rotation angle combination best matches all the points. When the entire process is complete, the position with the most votes in the accumulator array represents the optimal model point and rotation angle combination (m). r This is also the optimal local coordinate system (α). This combination can be used to represent the pose of an object in the scene.
[0193] This determines the optimal 6D pose of the material in the current scene, namely the rotation matrix R and the translation vector t, where the translation vector t is the difference between the XYZ coordinates of the midpoint of the model and the midpoint of the scene. This enables accurate recognition of the material's pose. Figure 7 shows the visualization of the point cloud after matching the input material A with the template point cloud of the downsampled material 3D model.
[0194] This embodiment can quickly and accurately identify the 6D pose of each material in a disordered stacked material scenario, thereby realizing automated material gripping and assembly.
[0195] Those skilled in the art will understand that, in addition to implementing the system, apparatus, and their modules provided by this invention in purely computer-readable program code, the same program can be implemented in the form of logic gates, switches, application-specific integrated circuits, programmable logic controllers, and embedded microcontrollers by logically programming the method steps. Therefore, the system, apparatus, and their modules provided by this invention can be considered a hardware component, and the modules included therein for implementing various programs can also be considered structures within the hardware component; alternatively, modules for implementing various functions can be considered both software programs implementing the method and structures within the hardware component.
[0196] Specific embodiments of the present invention have been described above. It should be understood that the present invention is not limited to the specific embodiments described above, and those skilled in the art can make various changes or modifications within the scope of the claims, which do not affect the essence of the present invention. Unless otherwise specified, the embodiments and features described in this application can be arbitrarily combined with each other.
Claims
1. A method for identifying and locating disordered materials based on 3D vision technology, characterized in that, include: Step S1: Convert the 3D model of the material object into point cloud data, perform downsampling on the point cloud data to generate a template point cloud of the material object, and calculate the PPF feature of the template point cloud of the material object. Step S2: In the disordered stacked material scene, based on the acquired RGB image and depth image of the material scene, the SAM model is used to segment the material to obtain the material point cloud of each instance, and the PPF feature of the material point cloud in the material scene is calculated. Step S3: Match the PPF features of the material point cloud in the material scene with the PPF features of the template point cloud to obtain the 6D pose of each material. The identification and localization of disordered materials are achieved through a single matching.
2. The method for identifying and locating disordered materials based on 3D vision technology according to claim 1, characterized in that, Step S1 includes: Step S1.1: Convert the 3D model of the material object into point cloud data; Step S1.2: The point cloud data is downsampled using the voxel grid method to obtain downsampled point cloud data; Step S1.3: The downsampled point cloud data is downsampled using a geometric feature-based downsampling technique to obtain the processed point cloud data, which generates a template point cloud for the material object. Step S1.4: Calculate the PPF features of the material object template point cloud based on the processed point cloud data.
3. The method for identifying and locating disordered materials based on 3D vision technology according to claim 2, characterized in that, Step S1.2 includes: Step S1.2.1: Divide the point cloud data of the material object into multiple voxels according to the preset voxel size; Step S1.2.2: Within each voxel, select the geometric center point within the voxel as the representative point to achieve downsampling processing; Among them, S j Denotes the set of points within each voxel, (x i ,y i ,z i ) represents the spatial coordinates of each point within a voxel.
4. The method for identifying and locating disordered materials based on 3D vision technology according to claim 2, characterized in that, Step S1.3 includes: Step S1.3.1: Calculate the normal vector for each point selected using the voxel mesh method; Where, N i Point P i The neighborhood point set, n j For the neighborhood point P j The normal vector, |N i | represents the number of neighboring points; Step S1.3.2: Calculate the curvature k of each point selected using the voxel mesh method. i ; Where, r i It is point P i The radius of curvature at a point indicates the degree of curvature around that point; Step S1.3.3: Based on the changes in normal vector and curvature, determine whether a point is redundant, delete redundant points, and retain points that meet the preset requirements to achieve downsampling processing; The change in the normal vector is obtained by calculating the angle between the normal vectors of adjacent points: cos θ=n1·θ2 / (||n1|||n2||)) Based on the changes in the normal vector and the curvature, the composite metric is calculated. The composite metric is compared with a set threshold. When the composite metric is greater than the set threshold, it is judged as a redundant point. S=α·|Δn|+β·κ Where α represents the weight of the change in the normal vector; β represents the curvature weight; Δn represents the change in the normal vector; and k represents the curvature.
5. The method for identifying and locating disordered materials based on 3D vision technology according to claim 2, characterized in that, Step S1.4 includes: Step S1.4.1: For the downsampled point cloud, calculate p for each point. i The normal vector N(p) i ); N(p i )=normal(p i ) Where, normal represents P i Point normal vector, calculated by P i The neighborhood of point n i The normal vector is obtained; Step S1.4.2: For the downsampled point cloud, calculate the distance between any two points (p... i ,p j Point-to-point feature PPF; F(m1,m2)=(||d||,∠(n1,d),∠(n2,d)∠(n1,n2)) T PPF ={F(m i ,m j )|(m i ,m j )∈M,i≠j} Where F(m1,m2) represents any two points (p i p j The point-to-point feature PPF is denoted as d; d represents the Euclidean distance between two points; ∠(n1, n2) represents the angle between vectors; ∠(n1, d) and ∠(n2, d) represent the angle between the normal vector and the point-to-point vector; M represents a point on the model.
6. The method for identifying and locating disordered materials based on 3D vision technology according to claim 1, characterized in that, Step S2 includes: Step S2.1: Based on the RGB-D camera, acquire the RGB image and depth information of the material scene; Step S2.2: Preprocess the acquired RGB image and depth information of the material scene to obtain the preprocessed RGB image and depth information of the material scene; Step S2.3: The RGB image of the preprocessed material scene is used to identify each material object using the SAM model and generate a binary mask; Step S2.4: Based on the binary mask of each material object and the depth information of the preprocessed material scene, calculate the depth information containing only the material area; Step S2.5: Calculate point cloud information based on depth information containing only the material region and camera intrinsic parameters; Z = Depth(u,v) Among them, f x and f y It is the focal length of the camera in the x and y directions; c x and c y These are the optical center coordinates of the image; (X,Y,Z) represents the point cloud coordinates, and Depth(u,v) represents the depth value at (u,v), which represents the distance from the camera to the object surface. Step S2.6: Calculate the PPF features of the current material point cloud.
7. The method for identifying and locating disordered materials based on 3D vision technology according to claim 6, characterized in that, Step S2.2 includes: Step S2.2.1: Adjust the brightness distribution of the RGB image through histogram equalization so that the contrast of the adjusted RGB image meets the preset requirements; Step S2.2.2: Normalize the depth information of each pixel to obtain normalized depth information; Where Z represents, Z min Z represents the minimum depth value. max Indicates the maximum depth value; Step S2.2.3: Map the G channel in the adjusted RGB pixels to the normalized depth information; Step S2.2.4: Calculate the normal vector information of each pixel using the normalized depth information; Where u′ represents the horizontal coordinate of the image and v′ represents the vertical coordinate of the image; Step S2.2.5: Normalize the normal vector information and map it to the B channel of the adjusted RGB pixels.
8. The method for identifying and locating disordered materials based on 3D vision technology according to claim 1, characterized in that, Step S3 includes: Step S3.1: Select any reference point s from the surface of an object in the material scene. r And find the corresponding point m in the material model. r ; Step S3.2: Define the material model relative to the reference point s r A local coordinate system is established so that any reference point s in the material scene can be selected. r Find the corresponding point m in the material model r Align the position information and normal vector; Step S3.3: Select the point pair (s) in the material scene. r ,s i Material model point pairs (m) ∈ S with similar eigenvectors F r ,m i )∈M; Step S3.4: Based on the rotation matrix, point m in the material model i And point s in the material scenario i Alignment ensures that the points in the material model and the points in the material scene are geometrically identical. in, This represents the inverse transformation matrix that converts points in the global coordinate system back to the scene's local coordinate system. It reverses the transformation from the scene to the global coordinate system, remapping globally aligned points back to the scene's local coordinate system; R x (a) represents the rotation matrix T, which rotates the x-axis by an angle α. m→g This represents the transformation matrix that transforms material model points from the local coordinate system to the global coordinate system. Step S3.5: Obtain the optimal local coordinate system through the generalized Hough voting method, so that the maximum number of points in the material scene match the material model, thereby determining the orientation of the material.
9. The method for identifying and locating disordered materials based on 3D vision technology according to claim 8, characterized in that, Step S3.5 includes: Step S3.2.1: Define a two-dimensional accumulator array (N m N angle ); where N m N represents the number of sampling points in the model. angle The sampling step size represents the rotation angle α; Step S3.2.2: For reference point s in the material scenario r In addition to the reference point s, the scene selection also includes... r Other points besides these points form point pairs (s) r ,s i ), and calculate feature F for each pair of points. s (s r ,s i The feature F s (s r ,s i This includes the distance and relative direction between point pairs; Step S3.2.3: Calculate the feature F s As the key, it is used to search within the global material model to find the corresponding material model point pair (m). r ,m i ), to give them similar characteristics; Step S3.2.4: For each matched material model point pair (m) r ,m i ), calculate the rotation angle 'a' required to align the material model with the material scene; for each calculated angle 'a', increment the corresponding cell in the accumulator array by 1; after the entire process is complete, determine the optimal local coordinate system based on the position with the most votes in the accumulator array, and determine the best combination of material model point and rotation angle (m r ,a), to obtain the posture of the material in the scene.
10. A system for identifying and locating disordered materials based on 3D vision technology, characterized in that, include Module M1: Used to convert the 3D model of a material object into point cloud data, perform downsampling processing on the point cloud data, generate a template point cloud of the material object, and calculate the PPF feature of the template point cloud of the material object; Module M2: Used to segment materials in disordered stacked material scenarios based on the acquired RGB and depth images of the material scenario using the SAM model, obtain the material point cloud for each instance, and calculate the PPF features of the material point cloud in the material scenario. Module M3: Used to match the PPF features of the material point cloud in the material scene with the PPF features of the template point cloud to obtain the 6D pose of each material. It realizes the identification and localization of disordered materials through a single matching.