Semantic information-based point cloud completion method and apparatus, and point cloud completion model training method and apparatus
By using a point cloud completion method based on semantic information, which utilizes semantic segmentation and depth estimation networks to complete the point cloud, the problem of point cloud voids caused by dark surfaces of welded workpieces is solved, thereby improving weld recognition and welding quality.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- BEIJING XIAOYU INTELLISYS CO LTD
- Filing Date
- 2025-12-25
- Publication Date
- 2026-07-02
Smart Images

Figure CN2025145835_02072026_PF_FP_ABST
Abstract
Description
A point cloud completion method, model training method, and device based on semantic information
[0001] Cross-references to related applications
[0002] This disclosure is based on and claims priority to Chinese Patent Application No. 202411920702.1, filed on December 25, 2024, the entire contents of which are incorporated herein by reference. Technical Field
[0003] This disclosure relates to the fields of 3D point cloud completion and welding technology, and in particular to a point cloud completion method, model training method and apparatus based on semantic information. Background Technology
[0004] Welding robots are increasingly widely used in industry, and weld seam recognition technology, acting as the "eyes" of these robots, is one of the key technologies for achieving efficient welding. Among related technologies, weld seam recognition technology primarily relies on 3D camera sensors, such as line laser scanning 3D cameras, which are widely used due to the excellent single-image and anti-interference properties of lasers. Common line laser 3D cameras typically consist of a laser scanner and a pair of binocular cameras, estimating workpiece depth through traditional binocular matching combined with line laser features.
[0005] However, in certain scenarios, such as when the surface of the workpiece being welded is dark, the laser may be absorbed, resulting in point cloud voids in the 3D camera's imaging. This prevents the robot from correctly identifying the weld seam, leading to welding failure. Therefore, solving the point cloud void problem has become an urgent issue. Summary of the Invention
[0006] This disclosure provides a point cloud completion method, a model training method, and an apparatus based on semantic information.
[0007] According to a first aspect of the present disclosure, a point cloud completion method based on semantic information is provided, comprising:
[0008] Image processing is performed on the stereo image to be processed to obtain the semantic information of the target object in the stereo image to be processed and the first depth image of the stereo image to be processed.
[0009] Based on the semantic information, a coarse-grained completion operation is performed on the point cloud to be processed, and the point cloud after the coarse-grained completion operation is projected onto a two-dimensional space to obtain a second depth image corresponding to the point cloud; the point cloud to be processed is a point cloud including the target object, and the target object in the binocular image to be processed and the target object in the point cloud to be processed are the same target object.
[0010] A fusion operation is performed on the first depth image and the second depth image to obtain a depth-completed image;
[0011] Back-projecting the depth-completed image yields a completed 3D point cloud containing the target object.
[0012] According to a second aspect of the present disclosure, a method for training a point cloud completion model is provided, the point cloud completion model including a binocular depth estimation network and a deep fusion network, the training method comprising:
[0013] The binocular image samples are input into the binocular depth estimation network for semantic analysis and depth information extraction to obtain the semantic information of the target object samples in the binocular image samples and the first depth image of the binocular image samples;
[0014] Based on the semantic information, a coarse-grained completion operation is performed on the residual point cloud sample, and the point cloud after the coarse-grained completion operation is projected onto a two-dimensional space to obtain a second depth image corresponding to the residual point cloud sample; the residual point cloud sample is a point cloud including the target object sample, and the target object sample in the binocular image sample and the target object sample in the residual point cloud sample are the same target object.
[0015] Based on the semantic information, the first depth image, and the second depth image, and combined with the model loss function used by the binocular depth estimation network, the first model loss value is determined.
[0016] The first depth image and the second depth image are input into the depth fusion network for fusion operation to obtain the first depth-completed image;
[0017] Back-projecting the first depth-completed image yields a completed 3D point cloud containing the target object sample;
[0018] Based on the completed 3D point cloud containing the target object sample, the complete point cloud associated with the residual point cloud sample, and the target object sample label corresponding to the point cloud, the second model loss value is determined.
[0019] The binocular depth estimation network is trained based on the first model loss value, and the deep fusion network is trained based on the second model loss value.
[0020] According to a third aspect of the present disclosure, an end-to-end point cloud completion method is provided, comprising:
[0021] Acquire a stereo image to be processed and a point cloud to be processed. The point cloud to be processed is a point cloud including the target object. The target object in the stereo image to be processed and the target object in the point cloud to be processed are the same target object.
[0022] The binocular image to be processed and the defective point cloud to be processed are input into the point cloud completion model to obtain a completed 3D point cloud containing the target object; wherein, the point cloud completion model is a model trained based on the method described in the second aspect above.
[0023] According to a fourth aspect of the present disclosure, a point cloud completion device based on semantic information is provided, comprising:
[0024] The semantic analysis module is used to perform image processing on the stereo image to be processed, and to obtain the semantic information of the target object in the stereo image to be processed and the first depth image of the stereo image to be processed.
[0025] The completion preprocessing module is used to perform coarse-grained completion operation on the point cloud to be processed based on the semantic information, and project the point cloud after the coarse-grained completion operation onto a two-dimensional space to obtain a second depth image corresponding to the point cloud; the point cloud to be processed is a point cloud including the target object, and the target object in the binocular image to be processed and the target object in the point cloud to be processed are the same target object.
[0026] The fusion module is used to perform a fusion operation on the first depth image and the second depth image to obtain a depth-completed image;
[0027] The back projection module is used to back project the depth-completed image to obtain a completed 3D point cloud containing the target object.
[0028] According to a fifth aspect of the present disclosure, a point cloud completion model training apparatus is provided, the point cloud completion model including a binocular depth estimation network and a deep fusion network, the training apparatus comprising:
[0029] The first acquisition module is used to input the stereo image samples into the stereo depth estimation network for semantic analysis and depth information extraction, so as to obtain the semantic information of the target object samples in the stereo image samples and the first depth image of the stereo image samples;
[0030] The completion preprocessing module is used to perform coarse-grained completion operation on the incomplete point cloud sample based on the semantic information, and project the point cloud after the coarse-grained completion operation onto a two-dimensional space to obtain a second depth image corresponding to the incomplete point cloud sample; the incomplete point cloud sample is a point cloud including the target object sample, and the target object sample in the stereo image sample and the target object sample in the incomplete point cloud sample are the same target object.
[0031] The first determining module is used to determine a first model loss value based on the semantic information, the first depth image, and the second depth image, combined with the model loss function adopted by the stereo depth estimation network.
[0032] The fusion module is used to input the first depth image and the second depth image into the depth fusion network for fusion operation to obtain the first depth-completed image;
[0033] The back projection module is used to back project the first depth-completed image to obtain a completed 3D point cloud containing the target object sample.
[0034] The second determining module is used to determine the second model loss value based on the completed 3D point cloud containing the target object sample, the complete point cloud associated with the residual point cloud sample, and the target object sample label corresponding to the point cloud.
[0035] The training module is used to train the stereo depth estimation network based on the first model loss value and to train the deep fusion network based on the second model loss value.
[0036] According to a sixth aspect of the present disclosure, an end-to-end point cloud completion device is provided, comprising:
[0037] The first acquisition module is used to acquire the stereo image to be processed and the residual point cloud to be processed. The residual point cloud to be processed is a point cloud including the target object. The target object in the stereo image to be processed and the target object in the residual point cloud to be processed are the same target object.
[0038] The second acquisition module is used to input the stereo image to be processed and the defective point cloud to be processed into the point cloud completion model to obtain a completed 3D point cloud containing the target object; wherein, the point cloud completion model is a model trained based on the method described in the second aspect above.
[0039] According to a seventh aspect of the present disclosure, an electronic device is provided, comprising: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to perform the methods described in the first, second, or third aspects described above.
[0040] According to an eighth aspect of the present disclosure, a storage medium is provided that stores instructions that, when executed on an electronic device, cause the electronic device to perform the methods described in the first, second, or third aspects described above.
[0041] According to a ninth aspect of the present disclosure, a computer program product is provided, including a computer program that, when executed by a processor, implements the steps of the methods described in the first, second, or third aspects.
[0042] According to the technical solution disclosed herein, complete and accurate point cloud completion can be achieved by combining semantic information, the depth image of the binocular image to be processed, and the depth image corresponding to the defect cloud. This can significantly improve the accuracy and effect of point cloud completion. It can be exemplaryly applied to weld seam recognition scenarios, improving weld seam recognition performance and thus enhancing the generalization ability, welding efficiency, and quality of welding robots.
[0043] Additional aspects and advantages of this disclosure will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of this disclosure. Attached Figure Description
[0044] The above and / or additional aspects and advantages of this disclosure will become apparent and readily understood from the following description of the embodiments taken in conjunction with the accompanying drawings, in which:
[0045] Figure 1A is a flowchart illustrating the point cloud completion method based on semantic information provided in this embodiment of the present disclosure;
[0046] Figure 1B is an example diagram of the point cloud void effect in the weld of the workpiece provided in the embodiment of this disclosure;
[0047] Figure 1C is a diagram showing the effect of completing the defect cloud provided in the embodiment of this disclosure;
[0048] Figure 2 is a flowchart illustrating the point cloud completion method based on semantic information provided in this embodiment of the present disclosure;
[0049] Figure 3 is a flowchart illustrating the training method of the binocular depth estimation network provided in the embodiments of this disclosure;
[0050] Figure 4 is a schematic diagram of the point cloud completion preprocessing provided in the embodiments of this disclosure;
[0051] Figure 5 is a flowchart illustrating the training method for a deep fusion network provided in an embodiment of this disclosure;
[0052] Figure 6 is a flowchart illustrating the point cloud completion model training method provided in this embodiment of the present disclosure;
[0053] Figure 7 is a flowchart illustrating the point cloud completion model training method provided in this embodiment of the present disclosure;
[0054] Figure 8 is a flowchart illustrating the end-to-end point cloud completion method provided in the embodiments of this disclosure;
[0055] Figure 9 is a block diagram of the point cloud completion device based on semantic information provided in an embodiment of this disclosure;
[0056] Figure 10 is a block diagram of the point cloud completion model training device provided in an embodiment of this disclosure;
[0057] Figure 11 is a block diagram of the end-to-end point cloud completion device provided in an embodiment of this disclosure;
[0058] Figure 12 is a block diagram of an electronic device according to an embodiment of the present disclosure. Detailed Implementation
[0059] Exemplary embodiments will now be described in detail, examples of which are illustrated in the accompanying drawings. When the following description relates to the drawings, unless otherwise indicated, the same numerals in different drawings denote the same or similar elements. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatuses and methods consistent with some aspects of the invention as detailed in the appended claims.
[0060] This disclosure is not exhaustive, but merely illustrative of some embodiments, and is not intended to limit the scope of protection of this disclosure. Unless otherwise specified, each step in a particular embodiment can be implemented as an independent embodiment, and the steps can be arbitrarily combined. For example, a solution after removing some steps in a particular embodiment can also be implemented as an independent embodiment, and the order of the steps in a particular embodiment can be arbitrarily interchanged. Furthermore, the optional implementation methods in a particular embodiment can be arbitrarily combined; moreover, the embodiments can be arbitrarily combined, for example, some or all steps of different embodiments can be arbitrarily combined, and a particular embodiment can be arbitrarily combined with the optional implementation methods of other embodiments.
[0061] The prefixes "first," "second," etc., used in the embodiments of this disclosure are merely for distinguishing different descriptive objects and do not impose restrictions on the position, order, priority, quantity, or content of the descriptive objects. The description of the descriptive objects is found in the claims or the context of the embodiments, and the use of prefixes should not constitute unnecessary restrictions. For example, if the descriptive object is a "field," the ordinal numbers preceding "field" in "first field" and "second field" do not restrict the position or order of the "fields." "First" and "second" do not restrict whether the "fields" they modify are in the same message, nor do they restrict the order of "first field" and "second field." Similarly, if the descriptive object is a "level," the ordinal numbers preceding "level" in "first level" and "second level" do not restrict the priority between "levels." Furthermore, the number of descriptive objects is not limited by ordinal numbers and can be one or more. For example, in "first device," the number of "devices" can be one or more. Furthermore, the objects modified by different prefixes can be the same or different. For example, if the object being described is "device", then "first device" and "second device" can be the same device or different devices, and their types can be the same or different. Similarly, if the object being described is "information", then "first information" and "second information" can be the same information or different information, and their content can be the same or different.
[0062] In all embodiments disclosed herein, unless otherwise specified or logically conflicting, the terminology and / or descriptions are consistent across embodiments and can be referenced interchangeably. Technical features from different embodiments can be combined to form new embodiments based on their inherent logical relationships. The terminology used in the embodiments of this disclosure is for the purpose of describing specific embodiments only and is not intended to limit the scope of this disclosure.
[0063] It should be noted that the collection, storage, use, processing, transmission, provision, and disclosure of user personal information involved in the technical solution disclosed herein all comply with the provisions of relevant laws and regulations and do not violate public order and good morals.
[0064] It should also be noted that the information (including but not limited to user device information, user personal information, etc.), data (including but not limited to data used for analysis, data stored, data displayed, etc.) and signals involved in this disclosure are all authorized by the user or fully authorized by all parties, and the collection, use and processing of related data must comply with the relevant laws, regulations and standards of the relevant countries and regions.
[0065] It is worth noting that in the embodiments disclosed herein, certain software, components, models, and other existing solutions in the industry may be mentioned. These should be considered as exemplary, and their purpose is only to illustrate the feasibility of implementing the technical solutions disclosed herein. However, this does not mean that the applicant has used or necessarily used such solutions.
[0066] Welding robots are increasingly widely used in industry, and weld seam recognition technology, acting as the "eyes" of these robots, is one of the key technologies for achieving efficient welding. Among related technologies, weld seam recognition technology primarily relies on 3D camera sensors, such as line laser scanning 3D cameras, which are widely used due to the excellent single-image and anti-interference properties of lasers. Common line laser 3D cameras typically consist of a laser scanner and a pair of binocular cameras, estimating workpiece depth through traditional binocular matching combined with line laser features.
[0067] However, in certain scenarios, such as when the surface of the workpiece being welded is dark, the laser may be absorbed, resulting in point cloud voids in the 3D camera's imaging. This prevents the robot from correctly identifying the weld seam, leading to welding failure. Therefore, solving the point cloud void problem has become an urgent issue.
[0068] To address the aforementioned issues, this disclosure provides a point cloud completion method, a point cloud completion model training method, and an apparatus based on semantic information. This method can predict complete and accurate workpiece depth and semantic information to complete the point cloud, significantly improving the point cloud completion effect. This, in turn, significantly enhances the generalization ability, welding efficiency, and quality of welding robots.
[0069] The following describes, with reference to the accompanying drawings, a point cloud completion method based on semantic information, a point cloud completion model training method, and an apparatus according to embodiments of the present disclosure.
[0070] Figure 1A is a flowchart illustrating the point cloud completion method based on semantic information provided in this embodiment of the present disclosure. As shown in Figure 1A, the point cloud completion method based on semantic information may include, but is not limited to, the following steps 101 to 104.
[0071] In step 101, image processing is performed on the stereo image to be processed to obtain the semantic information of the target object in the stereo image to be processed and the first depth image of the stereo image to be processed.
[0072] In some embodiments, semantic segmentation techniques can be used to perform semantic analysis on the stereo image to be processed, thereby obtaining semantic information of the target object in the stereo image. For example, a neural network model with semantic segmentation capabilities, such as a stereo depth estimation network, can be used to perform semantic analysis on the stereo image to be processed, thereby obtaining semantic information of the target object in the stereo image.
[0073] In some embodiments, the binocular image to be processed can be a binocular image including a target object. In some embodiments, the binocular image to be processed can be an image obtained by capturing the target object with a binocular camera. In some embodiments, the target object can be a weld in a welded workpiece, but is not limited thereto. For example, the first image and the second image in the binocular image can both be RGB images, but are not limited thereto.
[0074] In some embodiments, the binocular image to be processed includes a first image and a second image. For example, the binocular image to be processed may be two viewpoint images of the same target object scene taken by a binocular camera, one on the left and one on the right, and a disparity map is obtained by using a stereo matching algorithm. Based on the disparity map, a first depth image of the binocular image to be processed is obtained.
[0075] It should be noted that in some embodiments, other depth image generation methods can also be used to generate the first depth image of the stereo image to be processed. For example, a neural network can be used to achieve this. For instance, the stereo image to be processed can be input into a stereo depth estimation network to obtain the first depth image of the stereo image to be processed. However, this method is not limited to this, and this disclosure does not make any specific limitations on it, nor will it elaborate further.
[0076] In step 102, a coarse-grained completion operation is performed on the point cloud to be processed based on semantic information, and the point cloud after the coarse-grained completion operation is projected into a two-dimensional (2D) space to obtain the second depth image corresponding to the point cloud to be processed.
[0077] In some embodiments, the semantic information may include the location of the target object in the stereo image to be processed. For example, taking a weld in a welded workpiece as the target object, the semantic information may include the location of the weld in the stereo image to be processed.
[0078] In some embodiments, after obtaining the semantic information of the target object, a coarse-grained completion operation can be performed on the missing part of the point cloud to be processed based on the semantic information. For example, point cloud data belonging to the same direction as the target object can be found based on the semantic information. Because point clouds with the same semantic information have high structural similarity locally, this local structural similarity can be used to perform a coarse-grained completion operation on the missing parts (or void regions, or point cloud void regions) in the missing part of the point cloud to be processed. For example, taking the weld in a welded workpiece as the target object, point cloud data belonging to the same weld bead can be found based on the semantic information of the weld. Based on this found point cloud data, a coarse-grained completion operation can be performed on the missing parts of the weld point cloud. The point cloud after the coarse-grained completion operation is projected into 2D space to obtain a second depth image. In this embodiment of the present disclosure, by using point clouds with high local structural similarity based on semantic information to coarsely complete the missing parts, point cloud information with similar structures can be obtained without the complex operation of reconstructing the surface. Simultaneously, the semantic information improves the accuracy of point cloud completion.
[0079] In some embodiments, the defect cloud to be processed can be a point cloud including a target object, and the target object in the binocular image to be processed and the target object in the defect cloud to be processed can be the same target object. For example, the target object can be a weld in a welded workpiece, and the weld in the welded workpiece can be laser-scanned by a laser scanner to obtain the defect cloud to be processed; the weld in the welded workpiece can be photographed by a binocular camera to obtain the binocular image to be processed.
[0080] In step 103, the first depth image and the second depth image are fused to obtain a depth-completed image.
[0081] In some embodiments, a one-to-one correspondence between pixels in the first depth image and the second depth image can be found based on calibration information. A fusion operation is then performed based on this correspondence to obtain a depth-completed image. The calibration information may include, but is not limited to, parameter information of the device that captured the binocular image to be processed (such as a binocular camera) and parameter information of the device that captured the defect cloud to be processed (such as a laser scanner).
[0082] In some embodiments, the aforementioned "fusion operation" can be an image overlay process, such as directly adding two depth images, but it is not limited to this. For example, the two depth images can be fused using their respective weights, and this fusion process can be, for example, a weighted summation operation. In some embodiments, the weight values of the two depth images can be trained using deep learning, and the two depth images can be fused based on the trained weight values, with the fused image being used as the depth completion image.
[0083] In some embodiments, the first depth image can be a depth image obtained by processing the stereo image to be processed based on a pre-trained stereo depth estimation network. This stereo depth estimation network can ensure global consistency of depth information between the first and second depth images, and can also ensure geometric consistency of point clouds with the same semantic information between the two depth images. Based on this characteristic, after obtaining the first and second depth images, a fusion operation is performed on them. That is, the region information in the second depth image that has undergone coarse-grained completion is adjusted using the first depth image, thereby obtaining a complete and accurate depth-completed image of the target object.
[0084] In step 104, the depth-completed image is back-projected to obtain a completed 3D point cloud containing the target object.
[0085] In some embodiments, the aforementioned "back projection" may refer to mapping from 2D (two-dimensional) space to 3D (three-dimensional) space. For example, in response to obtaining a depth-completed image, the depth-completed image can be mapped from 2D space to 3D space to obtain a completed 3D point cloud containing the target object.
[0086] In some embodiments, the point cloud completion method based on semantic information disclosed herein can be applied to weld seam recognition tasks. For example, during weld seam recognition, the point cloud completion method based on semantic information disclosed herein can be used to complete the defect cloud. For example, as shown in Figure 1B, region F is an example image of the weld seam of the workpiece captured by a 3D camera, showing a point cloud void effect in the 3D camera imaging result. As shown in Figure 1C, this is the effect image after completing the defect cloud using the point cloud completion method based on semantic information disclosed herein. It can be seen that by using the point cloud completion method disclosed herein, complete and accurate weld seam point cloud information can be obtained, thereby significantly improving the generalization ability, welding efficiency, and quality of welding robots.
[0087] In the above embodiments, by combining semantic information, the depth image of the binocular image to be processed, and the depth image corresponding to the defect cloud, complete and accurate point cloud completion can be achieved, which can significantly improve the accuracy and effect of point cloud completion. This can be exemplaryly applied to weld seam recognition scenarios, improving weld seam recognition performance and thus enhancing the generalization ability, welding efficiency, and quality of welding robots.
[0088] Figure 2 is a flowchart illustrating the point cloud completion method based on semantic information provided in this embodiment of the present disclosure. As shown in Figure 2, the point cloud completion method based on semantic information may include, but is not limited to, the following steps 201 to 204.
[0089] In step 201, based on the pre-trained binocular depth estimation network, semantic analysis and depth information extraction are performed on the binocular image to be processed to obtain the semantic information of the target object in the binocular image to be processed and the first depth image of the binocular image to be processed.
[0090] For example, the stereo image to be processed can be input into a pre-trained stereo depth estimation network for semantic analysis and depth information extraction to obtain the semantic information of the target object in the stereo image and a first depth image of the stereo image. In some embodiments, the input to the stereo depth estimation network can be the stereo image to be processed, and the output of the stereo depth estimation network can include the semantic information of the target object and the first depth image. The stereo depth estimation network has learned the mapping relationship between the stereo image and the semantic information of the target object and the first depth image, respectively.
[0091] In some embodiments, the above-mentioned binocular depth estimation network may be pre-trained, as shown in FIG3. The training method of the above-mentioned binocular depth estimation network may include, but is not limited to, the following steps 301 to 303.
[0092] In step 301, the binocular image samples are input into the binocular depth estimation network for semantic analysis and depth information extraction to obtain the semantic information of the target object samples in the binocular image samples and the first depth image of the binocular image samples.
[0093] In step 302, based on the semantic information of the target object sample, the first depth image of the binocular image sample, and the second depth image corresponding to the defect cloud sample, and combined with the model loss function adopted by the binocular depth estimation network, the first model loss value is determined.
[0094] In some embodiments, the aforementioned defect cloud sample may be a point cloud including a target object sample, wherein the target object sample in the binocular image sample and the target object sample in the defect cloud sample are the same target object.
[0095] For example, the semantic information of the target object sample, the first depth image of the binocular image sample, and the second depth image corresponding to the defect cloud sample can be substituted into the model loss function used by the binocular depth estimation network to calculate the loss value, and the calculated loss value is determined as the first model loss value of the binocular depth estimation network.
[0096] In some embodiments, the model loss function used by the above-mentioned binocular depth estimation network can be composed of a first loss function, a second loss function, and a third loss function. The first loss function is used to constrain the scale of the first depth image to maintain consistency with the scale of the second depth image; the second loss function, incorporating semantic information, is used to constrain the geometric structure consistency of the same semantic pixels in the first and second depth images in local space; and the third loss function is used to constrain the first and second depth images to maintain consistency in global geometric structure.
[0097] For example, the model loss function used by a stereo depth estimation network can be expressed as: Loss = λ s L s +λ r L r +λ n L n
[0098] Where Loss is the model loss function used by the stereo depth estimation network; L s Let λ be the first loss function. s The weights of the first loss function; L r For the second loss function, λ r The weights for the second loss function; L n For the third loss function, λ n These are the weights for the third loss function. For example, the first loss function L... s The expression can be:
[0099] Where d1(i) is the depth value of the i-th pixel in the first depth image, d2(i) is the depth value of the i-th pixel in the second depth image, and m is the number of pixels in the first and second depth images. The first loss function L... s It can be used to constrain the scale consistency between two depth images, the first depth image and the second depth image.
[0100] For example, the second loss function L r The expression can be:
[0101] Where d2(i) is the depth value of the i-th pixel in the second depth image, and d2(j) is the depth value of the j-th pixel in the second depth image; L r This can be expressed as follows: the depth value of the i-th pixel should be as consistent as possible with the depth values of semantically meaningful pixels within a distance r, where i is a semantically meaningful pixel, and N is the number of semantically meaningful pixels. ω j The weight is represented and calculated as follows: Where, p ixp iy Let p represent the positions of the i-th pixel in the second depth image, respectively. jx p jy Let L represent the position of the j-th pixel in the second depth image; σ is the standard deviation of the Gaussian function. The second loss function L... r Combining semantic information, it is used to constrain the geometric consistency of pixels with the same semantic meaning in the local space.
[0102] For example, the third loss function L n The expression can be:
[0103] Where n1(i) is the normal constraint of the i-th pixel in the first depth image, and n2(i) is the normal constraint of the i-th pixel in the second depth image. The third loss function L... n It is mainly used to constrain the global geometric structure consistency between the first depth image and the second depth image.
[0104] Therefore, it can be seen that the advantage of the model loss function constraint used by the binocular depth estimation network is that it constrains the global consistency of depth information while ensuring the geometric consistency of point clouds with the same semantic information, thereby further improving the completion accuracy of point cloud hole regions.
[0105] In some embodiments, the second depth image corresponding to the aforementioned incomplete point cloud sample can be obtained based on the semantic information of the target object sample. For example, a coarse-grained completion operation can be performed on the incomplete point cloud sample based on the semantic information of the target object sample, and the point cloud after the coarse-grained completion operation can be projected into 2D space to obtain the second depth image corresponding to the incomplete point cloud sample. Optional implementations are similar to those described in this document regarding "obtaining the second depth image corresponding to the incomplete point cloud," and can be found in the relevant descriptions of how to obtain the second depth image corresponding to the incomplete point cloud in this document; these will not be repeated here.
[0106] In step 303, the stereo depth estimation network is trained based on the first model loss value.
[0107] For example, the gradient descent algorithm can be used to process the first model loss value, and the parameters in the stereo depth estimation network can be updated according to the result of the gradient descent processing until the model training termination condition is met, thereby obtaining the trained stereo depth estimation network. For example, the model training termination condition can be that the first model loss value meets a first preset condition, such as the first model loss value being less than or equal to a threshold; or, the model training termination condition can be that the number of model training iterations reaches a preset number.
[0108] Therefore, the trained binocular depth estimation network can be obtained through the above steps 301 to 303.
[0109] In step 202, a coarse-grained completion operation is performed on the point cloud to be processed based on semantic information, and the point cloud after the coarse-grained completion operation is projected into a two-dimensional space to obtain the second depth image corresponding to the point cloud to be processed.
[0110] For example, after obtaining the semantic information of the target object in the binocular image to be processed, the semantic information can be used to perform a coarse-grained completion operation on the point cloud to be processed, and the point cloud after the coarse-grained completion operation can be projected into 2D space to obtain the second depth image corresponding to the point cloud.
[0111] In some embodiments, the optional implementation of the coarse-grained completion operation of the point cloud to be processed based on semantic information may include the following steps: obtaining candidate point clouds for point cloud completion from the point cloud to be processed based on semantic information, and performing coarse-grained completion operation on the empty regions in the point cloud to be processed based on the candidate point clouds. In some embodiments, the optional implementation of obtaining candidate point clouds for point cloud completion from the point cloud to be processed based on semantic information may include the following steps: determining the bounding rectangle of the target object from the point cloud to be processed based on semantic information, obtaining the point cloud within the bounding rectangle from the point cloud to be processed, determining the empty regions in the point cloud within the bounding rectangle, and determining the point clouds within the bounding rectangle that are located around the empty regions and are in the same direction as the target object as candidate point clouds for point cloud completion.
[0112] For example, as shown in Figure 4, taking the weld in the workpiece as the target object, the bounding rectangle of the weld can be determined (or calculated) from the defect cloud to be processed based on the semantic information of the weld. Then, the point cloud within the bounding rectangle is obtained from the defect cloud to be processed. Since the pixel values in the cavity area of the weld are inconsistent with the pixel values in other areas of the weld and there is a significant difference, the cavity area (or cavity edge) in the point cloud within the bounding rectangle can be determined based on the pixel values in the first depth image. The point cloud around the cavity area and belonging to the same direction as the weld (or weld bead) is then extracted within the bounding rectangle. The extracted point cloud is determined as the candidate point cloud for point cloud completion. Based on the candidate point cloud, the ICP (Iterative Closest Point) algorithm is used to coarsely complete the point cloud of the cavity area, resulting in the coarsely completed point cloud.
[0113] In step 203, the first depth image and the second depth image are fused to obtain a depth-completed image.
[0114] In some embodiments, a depth-complete image can be obtained by fusing a first depth image of the stereo image to be processed and a second depth image corresponding to the defect cloud based on a pre-trained deep fusion network. For example, the first depth image and the second depth image can be input into a pre-trained deep fusion network for fusion to obtain the depth-complete image output by the deep fusion network. In some embodiments, the input to the deep fusion network includes the first depth image of the stereo image to be processed and the second depth image corresponding to the defect cloud, and the output of the deep fusion network is the depth-complete image. The deep fusion network has learned the mapping relationship between the first depth image of the stereo image, the second depth image corresponding to the defect cloud, and the depth-complete image.
[0115] In some embodiments, the deep fusion network described above may be pre-trained. As shown in FIG5, the training method of the deep fusion network may include, but is not limited to, the following steps 501 to 504.
[0116] In step 501, the first depth image of the binocular image sample and the second depth image corresponding to the defect cloud sample are input into the deep fusion network for fusion operation to obtain the first depth completion image.
[0117] In some embodiments, the above-mentioned "fusion operation" can be an image overlay process, such as directly adding two depth images, but it is not limited to this. For example, the formula of the deep fusion network can be expressed as: depth_fusion=ω1depth(1)+ω2depth(2), where depth(1) is the first depth image, ω1 is the weight of the first depth image during the fusion operation; depth(2) is the second depth image, and ω2 is the weight of the second depth image during the fusion operation. Wherein, ω1 and ω2 are learnable parameters of the deep fusion network.
[0118] In some embodiments, the aforementioned defect cloud sample may be a point cloud including a target object sample, and the target object sample in the aforementioned binocular image sample and the target object sample in the defect cloud sample may be the same target object.
[0119] In step 502, the first depth-completed image is back-projected to obtain a completed 3D point cloud containing the target object sample.
[0120] For example, the first depth-completed image can be mapped from 2D space to 3D space to obtain a completed 3D point cloud containing a sample of the target object.
[0121] In step 503, the second model loss value is determined based on the completed 3D point cloud containing the target object sample, the complete point cloud associated with the incomplete point cloud sample, and the target object sample label corresponding to the point cloud.
[0122] For example, a second model loss value for the deep fusion network can be determined based on the completed 3D point cloud containing target object samples, the complete point cloud associated with the incomplete point cloud samples, and the target object sample labels corresponding to the point clouds, combined with the model loss function of the deep fusion network. For example, the model loss function of the deep fusion network can be the cross-entropy function, but is not limited to it; for example, the model loss function of the deep fusion network can be the L1 loss function (also known as the mean absolute error).
[0123] In step 504, the deep fusion network is trained based on the second model loss value until the model training termination condition is met.
[0124] For example, the gradient descent algorithm can be used to process the loss value of the second model, and the parameters in the deep fusion network can be updated according to the result of the gradient descent processing until the model training termination condition is met, thereby obtaining the trained deep fusion network. For example, the model training termination condition can be that the loss value of the second model meets a second preset condition, such as the loss value of the second model being less than or equal to a threshold; or, the model training termination condition can be that the number of iterations of model training reaches a preset number.
[0125] Therefore, the trained deep fusion network can be obtained through the above steps 501 to 504.
[0126] In step 204, the depth-completed image is back-projected to obtain a completed 3D point cloud containing the target object.
[0127] The optional implementation of step 204 can be found in the optional implementation of step 104 in Figure 1A and other related parts in the embodiments involved in Figure 1A, which will not be repeated here.
[0128] In the above embodiments, this disclosure proposes a point cloud completion preprocessing method that incorporates semantic information by introducing a binocular depth estimation network. By leveraging semantic information, coarse-grained point cloud completion is achieved, which can further improve the convergence and completion effect of subsequent models. Furthermore, this disclosure provides a loss function structure based on semantic information for training the neural network for point cloud completion. By combining a loss function with global and local consistency, the accuracy and robustness of point cloud completion can be further improved.
[0129] Figure 6 is a schematic flowchart of the point cloud completion model training method provided in an embodiment of this disclosure. Figure 7 is an example flowchart of the point cloud completion model training method provided in an embodiment of this disclosure. In some embodiments, the point cloud completion model may include a binocular depth estimation network and a deep fusion network. As shown in Figures 6 and 7, the point cloud completion model training method may include, but is not limited to, the following steps.
[0130] In step 601, the binocular image samples are input into the binocular depth estimation network for semantic analysis and depth information extraction to obtain the semantic information of the target object samples in the binocular image samples and the first depth image of the binocular image samples.
[0131] For example, taking the weld in the workpiece as the target object sample, a stereo image sample is obtained. The stereo image sample includes a first image (i.e., RGB1 as shown in Figure 7) and a second image (i.e., RBG2 as shown in Figure 7). RGB1 and RGB2 are input into the stereo depth estimation network for semantic analysis and depth information extraction to obtain the semantic information of the weld in the stereo image sample and the first depth image of the stereo image sample.
[0132] In step 602, a coarse-grained completion operation is performed on the incomplete point cloud sample based on semantic information, and the point cloud after the coarse-grained completion operation is projected onto a two-dimensional space to obtain the second depth image corresponding to the incomplete point cloud sample.
[0133] For example, a residual point cloud sample is obtained. This residual point cloud sample is a point cloud that includes a target object sample. The target object sample in the stereo image sample and the target object sample in the residual point cloud sample are the same target object. As shown in Figure 7, semantic information is used to perform point cloud completion preprocessing on the hole regions in the residual point cloud sample (i.e., coarse-grained point cloud completion operation), and the point cloud after point cloud completion preprocessing is projected into 2D space to obtain the second depth image corresponding to the residual point cloud sample.
[0134] In step 603, based on semantic information, the first depth image, and the second depth image, and combined with the model loss function used by the binocular depth estimation network, the first model loss value is determined.
[0135] For example, as shown in Figure 7, depth mutual supervision is used to train a stereo depth estimation network. This depth mutual supervision is conducted through the model loss function used by the stereo depth estimation network. For example, depth mutual supervision can use the model loss function used by the stereo depth estimation network, combined with semantic information, a first depth image, and a second depth image, to determine a first model loss value. Optional implementations can be found in the optional implementations of step 302 in Figure 3 and other related parts of the embodiments involved in Figure 3, which will not be elaborated here.
[0136] In step 604, the first depth image and the second depth image are input into the depth fusion network for fusion operation to obtain the first depth-completed image.
[0137] The optional implementation of step 604 can be found in the optional implementation of step 501 in Figure 5 above, and other related parts in the embodiments involved in Figure 5, which will not be repeated here.
[0138] In step 605, the first depth-completed image is back-projected to obtain a completed 3D point cloud containing the target object sample.
[0139] The optional implementation of step 605 can be found in the optional implementation of step 502 in Figure 5 above, and other related parts in the embodiments involved in Figure 5, which will not be repeated here.
[0140] In step 606, the second model loss value is determined based on the completed 3D point cloud containing the target object sample, the complete point cloud associated with the incomplete point cloud sample, and the target object sample label corresponding to the point cloud.
[0141] For example, taking the weld seam of the workpiece as the target object sample, as shown in Figure 7, after obtaining the complete 3D point cloud containing the weld seam, one-to-one supervision of the point cloud can be performed based on the complete point cloud associated with the defect cloud sample and the weld seam label corresponding to the point cloud, and the second model loss value of the deep fusion network can be determined by combining the loss function.
[0142] The optional implementation of step 606 can be found in the optional implementation of step 503 in Figure 5 above, and other related parts in the embodiments involved in Figure 5, which will not be repeated here.
[0143] In step 607, the stereo depth estimation network is trained based on the first model loss value, and the deep fusion network is trained based on the second model loss value.
[0144] In some embodiments, the stereo depth estimation network and the deep fusion network can be trained together, or they can be trained separately; this disclosure does not specifically limit this. For example, upon obtaining the first model loss value of the stereo depth estimation network, the stereo depth estimation network can be trained based on this first model loss value. After training the stereo depth estimation network, its parameters are fixed, and then the deep fusion network is trained.
[0145] In the above embodiments, this disclosure proposes a point cloud completion preprocessing method that incorporates semantic information by introducing a binocular depth estimation network. By leveraging semantic information, coarse-grained point cloud completion is achieved, which can further improve the convergence and completion effect of subsequent models. Furthermore, this disclosure provides a loss function structure based on semantic information for training the neural network for point cloud completion. By combining a loss function with global and local consistency, the accuracy and robustness of point cloud completion can be further improved.
[0146] Figure 8 is a flowchart illustrating the end-to-end point cloud completion method provided in this embodiment of the present disclosure. As shown in Figure 8, the end-to-end point cloud completion method may include, but is not limited to, the following steps.
[0147] In step 801, the binocular image to be processed and the residual cloud to be processed are acquired.
[0148] In some embodiments, the defect cloud to be processed can be a point cloud including the target object. The target object in the stereo image to be processed and the target object in the defect cloud to be processed can be the same target object.
[0149] In step 802, the binocular image to be processed and the defective point cloud to be processed are input into the point cloud completion model to obtain a completed 3D point cloud containing the target object.
[0150] In some embodiments, the point cloud completion model may be pre-trained. The training method for the point cloud completion model can be found in the relevant description of any embodiment of the point cloud completion model training method described above, and will not be repeated here.
[0151] In the above embodiments, end-to-end point cloud completion can be achieved, resulting in complete and accurate point cloud completion, which can significantly improve the accuracy and effect of point cloud completion. This can be exemplaryly applied to weld seam recognition scenarios, improving weld seam recognition performance and thus enhancing the generalization ability, welding efficiency, and quality of welding robots.
[0152] Figure 9 is a block diagram of a point cloud completion device based on semantic information provided in an embodiment of this disclosure. As shown in Figure 9, the point cloud completion device based on semantic information may include: a semantic analysis module 901, a completion preprocessing module 902, a fusion module 903, and a back projection module 904.
[0153] The semantic analysis module 901 is used to perform image processing on the stereo image to be processed, and to obtain the semantic information of the target object in the stereo image to be processed and the first depth image of the stereo image to be processed.
[0154] In some embodiments, the semantic analysis module 901 is used to: perform semantic analysis and depth information extraction on the stereo image to be processed based on a pre-trained stereo depth estimation network, to obtain the semantic information of the target object in the stereo image to be processed and the first depth image of the stereo image to be processed; wherein, the input of the stereo depth estimation network is the stereo image to be processed, and the output of the stereo depth estimation network includes the semantic information of the target object and the first depth image, and the stereo depth estimation network has learned the mapping relationship between the stereo image and the semantic information of the target object and the first depth image respectively.
[0155] In some embodiments, the binocular depth estimation network is trained as follows: binocular image samples are input into the binocular depth estimation network for semantic analysis and depth information extraction to obtain semantic information of the target object sample in the binocular image sample and a first depth image of the binocular image sample; based on the semantic information of the target object sample, the first depth image of the binocular image sample, and the second depth image corresponding to the residual point cloud sample, a first model loss value is determined in combination with the model loss function adopted by the binocular depth estimation network; the residual point cloud sample is a point cloud including the target object sample, and the target object sample in the binocular image sample and the target object sample in the residual point cloud sample are the same target object; the binocular depth estimation network is trained based on the first model loss value.
[0156] In some embodiments, the model loss function used by the binocular depth estimation network is composed of a first loss function, a second loss function, and a third loss function; wherein, the first loss function is used to constrain the scale of the first depth image and the scale of the second depth image to be consistent; the second loss function, combined with semantic information, is used to constrain the geometric structure consistency of the same semantic pixels in the first depth image and the second depth image in the local space; and the third loss function is used to constrain the first depth image and the second depth image to be consistent in global geometry.
[0157] The completion preprocessing module 902 is used to perform coarse-grained completion operations on the point cloud to be processed based on semantic information, and project the point cloud after coarse-grained completion operations onto a two-dimensional space to obtain a second depth image corresponding to the point cloud. The point cloud to be processed is a point cloud including the target object, and the target object in the stereo image to be processed and the target object in the point cloud to be processed are the same target object.
[0158] In some embodiments, the completion preprocessing module 902 is configured to: obtain candidate point clouds for point cloud completion from the residual point cloud to be processed based on semantic information; and perform coarse-grained completion operations on the empty regions in the residual point cloud to be processed based on the candidate point clouds. In some embodiments, the completion preprocessing module 902 is configured to: determine the bounding rectangle of the target object from the residual point cloud to be processed based on semantic information; obtain the point cloud within the bounding rectangle from the residual point cloud to be processed, and determine the empty regions in the point cloud within the bounding rectangle; and determine the point clouds within the bounding rectangle that are located around the empty regions and are in the same direction as the target object as candidate point clouds for point cloud completion.
[0159] The fusion module 903 is used to perform a fusion operation on the first depth image and the second depth image to obtain a depth-completed image. In some embodiments, the fusion module 903 is used to: perform a fusion operation on the first depth image of the stereo image to be processed and the second depth image corresponding to the defect cloud based on a pre-trained deep fusion network to obtain a depth-completed image; wherein, the input of the deep fusion network includes the first depth image of the stereo image to be processed and the second depth image corresponding to the defect cloud, and the output of the deep fusion network is the depth-completed image, and the deep fusion network has learned the mapping relationship between the first depth image of the stereo image, the second depth image corresponding to the defect cloud, and the depth-completed image.
[0160] In some embodiments, the deep fusion network is trained as follows: a first depth image of a binocular image sample and a second depth image corresponding to a residual point cloud sample are input into the deep fusion network for fusion to obtain a first depth-completed image; the residual point cloud sample is a point cloud including a target object sample, and the target object sample in the binocular image sample and the target object sample in the residual point cloud sample are the same target object; the first depth-completed image is back-projected to obtain a completed 3D point cloud containing the target object sample; a second model loss value is determined based on the completed 3D point cloud containing the target object sample, the complete point cloud associated with the residual point cloud sample, and the target object sample label corresponding to the point cloud; and the deep fusion network is trained based on the second model loss value.
[0161] The back projection module 904 is used to back project the depth-completed image to obtain a completed 3D point cloud containing the target object.
[0162] Regarding the apparatus in the above embodiments, the specific manner in which each module performs its operation has been described in detail in the embodiments related to the method, and will not be elaborated upon here.
[0163] Figure 10 is a block diagram of a point cloud completion model training device provided in an embodiment of this disclosure. In some embodiments, the point cloud completion model may include, but is not limited to, a binocular depth estimation network and a deep fusion network. As shown in Figure 10, the point cloud completion model training device may include: a first acquisition module 1001, a completion preprocessing module 1002, a first determination module 1003, a fusion module 1004, a back projection module 1005, a second determination module 1006, and a training module 1007.
[0164] The first acquisition module 1001 is used to input the binocular image samples into the binocular depth estimation network for semantic analysis and depth information extraction, so as to obtain the semantic information of the target object samples in the binocular image samples and the first depth image of the binocular image samples.
[0165] The completion preprocessing module 1002 is used to perform a coarse-grained completion operation on the incomplete point cloud sample based on the semantic information, and project the point cloud after the coarse-grained completion operation onto a two-dimensional space to obtain a second depth image corresponding to the incomplete point cloud sample. The incomplete point cloud sample is a point cloud including the target object sample, and the target object sample in the stereo image sample and the target object sample in the incomplete point cloud sample are the same target object.
[0166] The first determining module 1003 is used to determine a first model loss value based on the semantic information, the first depth image, and the second depth image, combined with the model loss function adopted by the stereo depth estimation network.
[0167] In some embodiments, the model loss function used by the binocular depth estimation network is composed of a first loss function, a second loss function, and a third loss function; wherein, the first loss function is used to constrain the scale of the first depth image and the scale of the second depth image to be consistent; the second loss function, combined with semantic information, is used to constrain the geometric structure consistency of the same semantic pixels in the first depth image and the second depth image in the local space; and the third loss function is used to constrain the first depth image and the second depth image to be consistent in global geometry.
[0168] The fusion module 1004 is used to input the first depth image and the second depth image into the depth fusion network for fusion operation to obtain the first depth-completed image.
[0169] The back projection module 1005 is used to back project the first depth-completed image to obtain a completed 3D point cloud containing the target object sample.
[0170] The second determining module 1006 is used to determine the second model loss value based on the completed 3D point cloud containing the target object sample, the complete point cloud associated with the residual point cloud sample, and the target object sample label corresponding to the point cloud.
[0171] The training module 1007 is used to train the stereo depth estimation network based on the first model loss value and to train the deep fusion network based on the second model loss value.
[0172] Regarding the apparatus in the above embodiments, the specific manner in which each module performs its operation has been described in detail in the embodiments related to the method, and will not be elaborated upon here.
[0173] Figure 11 is a block diagram of an end-to-end point cloud completion device provided in an embodiment of this disclosure. As shown in Figure 11, the end-to-end point cloud completion device may include: a first acquisition module 1101 and a second acquisition module 1102.
[0174] The first acquisition module 1101 is used to acquire the stereo image to be processed and the residual point cloud to be processed. The residual point cloud to be processed is a point cloud including the target object. The target object in the stereo image to be processed and the target object in the residual point cloud to be processed are the same target object.
[0175] The second acquisition module 1102 is used to input the stereo image to be processed and the defective point cloud to be processed into the point cloud completion model to obtain a completed 3D point cloud containing the target object.
[0176] In some embodiments, the point cloud completion model may be pre-trained. The training method for the point cloud completion model can be found in the relevant description of any embodiment of the point cloud completion model training method described above, and will not be repeated here.
[0177] Regarding the apparatus in the above embodiments, the specific manner in which each module performs its operation has been described in detail in the embodiments related to the method, and will not be elaborated upon here.
[0178] According to embodiments of this disclosure, this disclosure also provides an electronic device and a readable storage medium.
[0179] Figure 12 is a block diagram of an electronic device according to an embodiment of the present disclosure. The electronic device is intended to represent various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processors, cellular phones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions are merely illustrative and are not intended to limit the implementation of the present disclosure described and / or claimed herein.
[0180] As shown in Figure 12, the electronic device includes one or more processors 1201, a memory 1202, and interfaces for connecting the components, including high-speed interfaces and low-speed interfaces. The components are interconnected via different buses and can be mounted on a common motherboard or otherwise as required. The processors can process instructions executed within the electronic device, including instructions stored in or on memory to display graphical information of a GUI on an external input / output device (such as a display device coupled to the interface). In other embodiments, multiple processors and / or multiple buses can be used with multiple memories and multiple memory modules, if desired. Similarly, multiple electronic devices can be connected, each providing some of the necessary operations (e.g., as a server array, a group of blade servers, or a multiprocessor system). Figure 12 uses a single processor 1201 as an example.
[0181] The memory 1202 is the non-transitory computer-readable storage medium provided in this disclosure. The memory stores instructions executable by at least one processor to cause the at least one processor to perform the methods described in any of the above embodiments of this disclosure. The non-transitory computer-readable storage medium of this disclosure stores computer instructions for causing a computer to perform the methods described in any of the above embodiments of this disclosure.
[0182] The memory 1202, as a non-transitory computer-readable storage medium, can be used to store non-transitory software programs, non-transitory computer-executable programs, and modules, such as the program instructions / modules corresponding to the semantic information-based point cloud completion method, the point cloud completion model training method, and the end-to-end point cloud completion method in the embodiments of this disclosure. The processor 1201 executes various server functions and data processing by running the non-transitory software programs, instructions, and modules stored in the memory 1202, thereby implementing the semantic information-based point cloud completion method, the point cloud completion model training method, and the end-to-end point cloud completion method in the above method embodiments.
[0183] The memory 1202 may include a program storage area and a data storage area. The program storage area may store the operating system and applications required for at least one function; the data storage area may store data created based on the use of the electronic device. Furthermore, the memory 1202 may include high-speed random access memory and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, the memory 1202 may optionally include memory remotely located relative to the processor 1201, and these remote memories can be connected to the electronic device via a network. Examples of such networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
[0184] The electronic device may also include an input device 1203 and an output device 1204. The processor 1201, memory 1202, input device 1203 and output device 1204 may be connected by a bus or other means, as shown in Figure 12, which is an example of a bus connection.
[0185] Input device 1203 can receive input numerical or character information, and generate key signal inputs related to user settings and function control of the electronic device, such as touch screen, keypad, mouse, trackpad, touchpad, joystick, one or more mouse buttons, trackball, joystick, etc. Output device 1204 may include display device, auxiliary lighting device (e.g., LED), and haptic feedback device (e.g., vibration motor). The display device may include, but is not limited to, liquid crystal display (LCD), light-emitting diode (LED) display, and plasma display. In some embodiments, the display device may be a touch screen.
[0186] Various implementations of the systems and techniques described herein can be implemented in digital electronic circuit systems, integrated circuit systems, application-specific integrated circuits (ASICs), computer hardware, firmware, software, and / or combinations thereof. These various implementations may include: implementations in one or more computer programs that can be executed and / or interpreted on a programmable system including at least one programmable processor, which may be a dedicated or general-purpose programmable processor, capable of receiving data and instructions from a storage system, at least one input device, and at least one output device, and transferring data and instructions to the storage system, the at least one input device, and the at least one output device.
[0187] These computational programs (also referred to as programs, software, software applications, or code) include machine instructions for a programmable processor and can be implemented using high-level procedural and / or object-oriented programming languages, and / or assembly / machine languages. As used herein, the terms “machine-readable medium” and “computer-readable medium” refer to any computer program product, device, and / or apparatus (e.g., disk, optical disk, memory, programmable logic device (PLD)) used to provide machine instructions and / or data to a programmable processor, including machine-readable media that receive machine instructions as machine-readable signals. The term “machine-readable signal” refers to any signal used to provide machine instructions and / or data to a programmable processor.
[0188] To provide interaction with a user, the systems and techniques described herein can be implemented on a computer having: a display device for displaying information to the user (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor); and a keyboard and pointing device (e.g., a mouse or trackball) through which the user provides input to the computer. Other types of devices can also be used to provide interaction with the user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form (including sound input, voice input, or tactile input).
[0189] The systems and technologies described herein can be implemented in computing systems that include backend components (e.g., as data servers), or middleware components (e.g., application servers), or frontend components (e.g., user computers with graphical user interfaces or web browsers through which users can interact with implementations of the systems and technologies described herein), or any combination of such backend, middleware, or frontend components. The components of the system can be interconnected via digital data communication of any form or medium (e.g., communication networks). Examples of communication networks include local area networks (LANs), wide area networks (WANs), the Internet, and blockchain networks.
[0190] Computer systems can include clients and servers. Clients and servers are generally geographically separated and typically interact via communication networks. The client-server relationship is created by computer programs running on the respective computers and having a client-server relationship with each other. A server can be a cloud server, also known as a cloud computing server or cloud host, a hosting product within the cloud computing service ecosystem, addressing the shortcomings of traditional physical hosts and VPS (Virtual Private Server, or simply "VPS") services, such as high management difficulty and weak business scalability. Servers can also be servers for distributed systems or servers incorporating blockchain technology.
[0191] In the description of this disclosure, references to terms such as "one embodiment," "some embodiments," "example," "specific example," or "some examples," etc., indicate that a specific feature, structure, material, or characteristic described in connection with that embodiment or example is included in at least one embodiment or example of this disclosure. In this specification, the illustrative expressions of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the specific features, structures, materials, or characteristics described may be combined in any suitable manner in one or more embodiments or examples. Moreover, without contradiction, those skilled in the art can combine and integrate the different embodiments or examples described in this specification, as well as the features of different embodiments or examples.
[0192] Furthermore, in the description of this disclosure, "multiple" means at least two, such as two, three, etc., unless otherwise expressly and specifically limited.
[0193] Any process or method description in the flowchart or otherwise herein can be understood as representing a module, segment, or portion of code comprising one or more executable instructions for implementing custom logic functions or processes, and the scope of preferred embodiments of this disclosure includes additional implementations in which functions may be performed not in the order shown or discussed, including substantially simultaneously or in reverse order depending on the functions involved, as will be understood by those skilled in the art to which embodiments of this disclosure pertain.
[0194] The logic and / or steps represented in the flowchart or otherwise described herein, for example, can be considered as a sequenced list of executable instructions for implementing logical functions, and can be embodied in any computer-readable medium for use by, or in conjunction with, an instruction execution system, apparatus, or device (such as a computer-based system, a processor-included system, or other system that can fetch and execute instructions from, an instruction execution system, apparatus, or device). For the purposes of this specification, "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transmit programs for use by, or in conjunction with, an instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of computer-readable media include: an electrical connection having one or more wires (electronic device), a portable computer disk drive (magnetic device), random access memory (RAM), read-only memory (ROM), erasable and editable read-only memory (EPROM or flash memory), fiber optic devices, and portable optical disc read-only memory (CDROM). Alternatively, the computer-readable medium may be paper or other suitable media on which the program can be printed, since the program can be obtained electronically, for example, by optically scanning the paper or other medium, followed by editing, interpreting, or otherwise processing as necessary, and then stored in a computer memory.
[0195] It should be understood that various parts of this disclosure can be implemented using hardware, software, firmware, or a combination thereof. In the above embodiments, multiple steps or methods can be implemented using software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware as in another embodiment, it can be implemented using any one or a combination of the following techniques known in the art: discrete logic circuits having logic gates for implementing logical functions on data signals, application-specific integrated circuits (ASICs) having suitable combinational logic gates, programmable gate arrays (PGAs), field-programmable gate arrays (FPGAs), etc.
[0196] Those skilled in the art will understand that all or part of the steps of the methods described in the above embodiments can be implemented by a program instructing related hardware. The program can be stored in a computer-readable storage medium, and when executed, it includes one or a combination of the steps of the method embodiments.
[0197] Furthermore, the functional units in the various embodiments of this disclosure can be integrated into a processing module, or each unit can exist physically separately, or two or more units can be integrated into a module. The integrated module can be implemented in hardware or as a software functional module. If the integrated module is implemented as a software functional module and sold or used as an independent product, it can also be stored in a computer-readable storage medium.
[0198] The storage medium mentioned above can be a read-only memory, a disk, or an optical disk, etc. Although embodiments of the present disclosure have been shown and described above, it is to be understood that the above embodiments are exemplary and should not be construed as limiting the present disclosure. Those skilled in the art can make changes, modifications, substitutions, and variations to the above embodiments within the scope of the present disclosure.
Claims
1. A point cloud completion method based on semantic information, comprising: Image processing is performed on the stereo image to be processed to obtain the semantic information of the target object in the stereo image to be processed and the first depth image of the stereo image to be processed. Based on the semantic information, a coarse-grained completion operation is performed on the point cloud to be processed, and the point cloud after the coarse-grained completion operation is projected onto a two-dimensional space to obtain the second depth image corresponding to the point cloud to be processed. The point cloud to be processed is a point cloud including the target object, and the target object in the binocular image to be processed and the target object in the point cloud to be processed are the same target object; A fusion operation is performed on the first depth image and the second depth image to obtain a depth-completed image; Back-projecting the depth-completed image yields a completed 3D point cloud containing the target object.
2. The method as described in claim 1, wherein, The coarse-grained completion operation based on the semantic information for the incomplete cloud to be processed includes: Based on the semantic information, candidate point clouds for point cloud completion are obtained from the incomplete point cloud to be processed. Based on the candidate point cloud, a coarse-grained completion operation is performed on the void regions in the residual point cloud to be processed.
3. The method as described in claim 2, wherein, The step of obtaining candidate point clouds for point cloud completion from the unprocessed defect cloud based on the semantic information includes: Based on the semantic information, the bounding rectangle of the target object is determined from the defect cloud to be processed; Obtain the point cloud within the bounding rectangle from the residual point cloud to be processed, and determine the hole region in the point cloud within the bounding rectangle; The point cloud within the bounding rectangle that is located around the hole region and is in the same direction as the target object is identified as the candidate point cloud for point cloud completion.
4. The method according to any one of claims 1-3, wherein, The step of image processing the stereo image to be processed to obtain semantic information of the target object in the stereo image to be processed and a first depth image of the stereo image to be processed includes: Based on a pre-trained binocular depth estimation network, semantic analysis and depth information extraction are performed on the binocular image to be processed to obtain the semantic information of the target object in the binocular image to be processed and the first depth image of the binocular image to be processed. The input of the stereo depth estimation network is the stereo image to be processed, and the output of the stereo depth estimation network includes the semantic information of the target object and the first depth image. The stereo depth estimation network has learned the mapping relationship between the stereo image and the semantic information of the target object and the first depth image respectively.
5. The method of claim 4, wherein, The binocular depth estimation network was trained in the following way: The binocular image samples are input into the binocular depth estimation network for semantic analysis and depth information extraction to obtain the semantic information of the target object samples in the binocular image samples and the first depth image of the binocular image samples; Based on the semantic information of the target object sample, the first depth image of the stereo image sample, and the second depth image corresponding to the residual point cloud sample, and combined with the model loss function adopted by the stereo depth estimation network, the first model loss value is determined; the residual point cloud sample is a point cloud including the target object sample, and the target object sample in the stereo image sample and the target object sample in the residual point cloud sample are the same target object; The stereo depth estimation network is trained based on the first model loss value.
6. The method of claim 5, wherein, The model loss function used by the stereo depth estimation network is composed of a first loss function, a second loss function, and a third loss function; wherein, The first loss function is used to constrain the scale of the first depth image and the scale of the second depth image to remain consistent. The second loss function, combined with the semantic information, is used to constrain the geometric consistency of the same semantic pixels in the local space between the first depth image and the second depth image. The third loss function is used to constrain the first depth image and the second depth image to maintain consistency in global geometry.
7. The method according to any one of claims 1-6, wherein, The step of fusing the first depth image and the second depth image to obtain a depth-completed image includes: The first depth image of the stereo image to be processed and the second depth image corresponding to the defect cloud are fused based on a pre-trained deep fusion network to obtain the depth-completed image. The input of the deep fusion network includes a first depth image of the stereo image to be processed and a second depth image corresponding to the defect cloud. The output of the deep fusion network is the depth completion image. The deep fusion network has learned the mapping relationship between the first depth image of the stereo image, the second depth image corresponding to the defect cloud, and the depth completion image.
8. The method of claim 7, wherein, The deep fusion network was trained in the following way: The first depth image of the binocular image sample and the second depth image corresponding to the defect cloud sample are input into the depth fusion network for fusion operation to obtain the first depth complete image; The residual point cloud sample is a point cloud that includes a target object sample, and the target object sample in the binocular image sample and the target object sample in the residual point cloud sample are the same target object; Back-projecting the first depth-completed image yields a completed 3D point cloud containing the target object sample; Based on the completed 3D point cloud containing the target object sample, the complete point cloud associated with the residual point cloud sample, and the target object sample label corresponding to the point cloud, the second model loss value is determined. The deep fusion network is trained based on the second model loss value.
9. A method for training a point cloud completion model, wherein, The point cloud completion model includes a binocular depth estimation network and a deep fusion network, and the training method includes: The binocular image samples are input into the binocular depth estimation network for semantic analysis and depth information extraction to obtain the semantic information of the target object samples in the binocular image samples and the first depth image of the binocular image samples; Based on the semantic information, a coarse-grained completion operation is performed on the residual point cloud sample, and the point cloud after the coarse-grained completion operation is projected onto a two-dimensional space to obtain a second depth image corresponding to the residual point cloud sample; the residual point cloud sample is a point cloud including the target object sample, and the target object sample in the binocular image sample and the target object sample in the residual point cloud sample are the same target object. Based on the semantic information, the first depth image, and the second depth image, and combined with the model loss function used by the binocular depth estimation network, the first model loss value is determined. The first depth image and the second depth image are input into the depth fusion network for fusion operation to obtain the first depth-completed image; Back-projecting the first depth-completed image yields a completed 3D point cloud containing the target object sample; Based on the completed 3D point cloud containing the target object sample, the complete point cloud associated with the residual point cloud sample, and the target object sample label corresponding to the point cloud, the second model loss value is determined. The binocular depth estimation network is trained based on the first model loss value, and the deep fusion network is trained based on the second model loss value.
10. The method of claim 9, wherein, The model loss function used by the stereo depth estimation network is composed of a first loss function, a second loss function, and a third loss function; wherein, The first loss function is used to constrain the scale of the first depth image and the scale of the second depth image to remain consistent. The second loss function, combined with the semantic information, is used to constrain the geometric consistency of the same semantic pixels in the local space between the first depth image and the second depth image. The third loss function is used to constrain the first depth image and the second depth image to maintain consistency in global geometry.
11. An end-to-end point cloud completion method, comprising: Acquire a stereo image to be processed and a point cloud to be processed. The point cloud to be processed is a point cloud including the target object. The target object in the stereo image to be processed and the target object in the point cloud to be processed are the same target object. The binocular image to be processed and the defective point cloud to be processed are input into the point cloud completion model to obtain a completed 3D point cloud containing the target object; wherein, the point cloud completion model is a model trained based on the method described in claim 9 or 10.
12. A point cloud completion device based on semantic information, comprising: The semantic analysis module is used to perform image processing on the stereo image to be processed, and to obtain the semantic information of the target object in the stereo image to be processed and the first depth image of the stereo image to be processed. The completion preprocessing module is used to perform coarse-grained completion operation on the point cloud to be processed based on the semantic information, and project the point cloud after the coarse-grained completion operation onto a two-dimensional space to obtain a second depth image corresponding to the point cloud; the point cloud to be processed is a point cloud including the target object, and the target object in the binocular image to be processed and the target object in the point cloud to be processed are the same target object. The fusion module is used to perform a fusion operation on the first depth image and the second depth image to obtain a depth-completed image; The back projection module is used to back project the depth-completed image to obtain a completed 3D point cloud containing the target object.
13. A point cloud completion model training device, wherein, The point cloud completion model includes a binocular depth estimation network and a deep fusion network, and the training device includes: The first acquisition module is used to input the stereo image samples into the stereo depth estimation network for semantic analysis and depth information extraction, so as to obtain the semantic information of the target object samples in the stereo image samples and the first depth image of the stereo image samples; The completion preprocessing module is used to perform coarse-grained completion operation on the incomplete point cloud sample based on the semantic information, and project the point cloud after the coarse-grained completion operation onto a two-dimensional space to obtain a second depth image corresponding to the incomplete point cloud sample; the incomplete point cloud sample is a point cloud including the target object sample, and the target object sample in the stereo image sample and the target object sample in the incomplete point cloud sample are the same target object. The first determining module is used to determine a first model loss value based on the semantic information, the first depth image, and the second depth image, combined with the model loss function adopted by the stereo depth estimation network. The fusion module is used to input the first depth image and the second depth image into the depth fusion network for fusion operation to obtain the first depth-completed image; The back projection module is used to back project the first depth-completed image to obtain a completed 3D point cloud containing the target object sample. The second determining module is used to determine the second model loss value based on the completed 3D point cloud containing the target object sample, the complete point cloud associated with the residual point cloud sample, and the target object sample label corresponding to the point cloud. The training module is used to train the stereo depth estimation network based on the first model loss value and to train the deep fusion network based on the second model loss value.
14. An end-to-end point cloud completion device, comprising: The first acquisition module is used to acquire the stereo image to be processed and the residual point cloud to be processed. The residual point cloud to be processed is a point cloud including the target object. The target object in the stereo image to be processed and the target object in the residual point cloud to be processed are the same target object. The second acquisition module is used to input the stereo image to be processed and the defective point cloud to be processed into the point cloud completion model to obtain a completed 3D point cloud containing the target object; wherein, the point cloud completion model is a model trained based on the method described in claim 9 or 10.
15. An electronic device comprising: At least one processor; and a memory communicatively connected to the at least one processor; wherein, The memory stores instructions executable by the at least one processor, which, when executed by the at least one processor, enables the at least one processor to perform the method of any one of claims 1-8, 9-10, and 11.
16. A storage medium storing instructions, wherein, When the instructions are executed on an electronic device, the electronic device causes the electronic device to perform the method of any one of claims 1-8, 9-10, and 11.
17. A computer program product comprising a computer program, wherein, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1-8, 9-10, and 11.