An open-vocabulary 3D object detection method based on geometric perception and multi-modal feature alignment
By optimizing the 3D bounding box pseudo-labels and multimodal feature alignment through geometric perception, the problem of insufficient generalization ability of 3D object detection methods on unseen categories is solved, achieving higher localization accuracy and classification accuracy, and enhancing the practical application capability of the model.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- CHONGQING UNIV OF POSTS & TELECOMM
- Filing Date
- 2026-04-17
- Publication Date
- 2026-06-12
AI Technical Summary
Existing 3D object detection methods lack generalization ability in complex and open real-world environments, especially in terms of poor detection accuracy and robustness for unseen categories. This is mainly due to the inaccuracy of 3D bounding box pseudo-labels caused by missing RGB-D point clouds and errors, which affects the spatial localization and classification of the model.
By introducing geometric perception and multimodal feature alignment methods, the size and orientation of 3D bounding box pseudo-labels are optimized. Point cloud and image features are fused with text features during the classification stage to construct a more discriminative feature representation, thereby achieving accurate localization and classification of new categories of objects.
It improves the model's localization accuracy and classification accuracy for new object categories, enhances its generalization and robustness in real-world scenarios, and particularly improves its detection performance on the SUN RGB-D and ScanNet datasets.
Smart Images

Figure CN122196640A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of open vocabulary object detection, specifically involving an open vocabulary 3D object detection method based on geometric perception and multimodal feature alignment. Background Technology
[0002] 3D point cloud object detection is a fundamental task in scene understanding and has made significant progress in recent years, finding wide application in fields such as autonomous driving and robotics. Existing 3D object detection methods typically rely on a predefined set of object categories, learning feature representations for these categories from large amounts of labeled point cloud data. These methods achieve high detection accuracy and robustness on previously seen categories, but their generalization ability to unseen categories is limited. This limitation restricts their applicability in complex, open real-world environments.
[0003] To overcome the insufficient generalization ability of existing 3D object detection methods, open-vocabulary object detection has become a research hotspot in this field. Its goal is to enable models to generalize to unseen categories, attracting widespread attention from the research community. In particular, visual-language foundational models (VLMs, such as CLIP) have been pre-trained on large-scale image-text pairs, driving the rapid development of open-vocabulary 2D object detection. Thanks to large-scale pre-training on image-text pairs, VLMs have demonstrated strong generalization ability, robustness, and zero-shot detection capability in open-vocabulary 2D object detection. However, transferring these capabilities to the 3D domain is not easy, mainly due to the lack of large-scale point cloud-text pairing data and the extremely high training cost of VLMs in 3D scenes. To address this issue, existing research utilizes the natural correspondence between images and point clouds, as well as the aligned image-text feature representations learned by visual-language models, using images as an intermediate modality connecting point clouds and text to achieve knowledge transfer from 2D visual-language models to open-vocabulary 3D object detection.
[0004] Nevertheless, the significant differences between 2D and 3D modalities limit the effectiveness of such knowledge transfer. Unlike 2D object detection, 3D object detection requires precise spatial localization in addition to object classification, thus essentially relying on accurate 3D bounding box supervision. To achieve generalization to unseen categories, existing methods typically project the detection results of open-vocabulary 2D object detection models into 3D space to obtain 3D bounding box pseudo-labels. However, RGB-D point clouds often suffer from severe missing points and erroneous outliers, mainly due to the inherent limitations of depth sensors and errors in the geometric projection process. Specifically, this manifests in the following ways: (1) Depth sensors rely on infrared structured light or stereo vision, and their measurements are easily affected by highly reflective, transparent, or textured surfaces (such as glass, mirrors, or white walls), resulting in large-area depth loss; (2) The effective depth range of the sensor is limited, and depth quantization introduces significant noise and measurement bias in distant areas; (3) At the boundary of a target with abrupt depth changes, the sensor's depth estimation capability is limited, leading to floating points and spatial misalignment in the edge area; (4) Scene-related factors such as self-occlusion and fine structures can further prevent some areas from being observed, as a single viewpoint can only capture the visible surface of the object. These limitations of RGB-D point clouds pose a huge challenge to accurate 3D perception and representation learning. The resulting 3D bounding box pseudo-labels have significant deviations from manual annotations in key parameters such as size, orientation, and position. Such deviations not only hinder the model from learning accurate spatial localization but also reduce the discriminative power of point cloud features among different target categories, thereby increasing the difficulty of classification tasks. Summary of the Invention
[0005] To address the problems existing in the prior art, this invention proposes an open vocabulary 3D object detection method based on geometric perception and multimodal feature alignment. The method includes: acquiring point cloud data, inputting the point cloud data into a trained open vocabulary 3D object detection model, and obtaining the detection result.
[0006] The open-vocabulary 3D object detection model perceives objects in space in two phases: localization and classification. In the localization phase, a 3D bounding box pseudo-label size optimization module is constructed to correct oversized or undersized bounding boxes, resulting in appropriately sized pseudo-labels. A 3D bounding box pseudo-label orientation correction module is used to correct the orientation of the pseudo-labels. In the classification phase, instance-level point cloud data, image data, and text prompt data are acquired. Feature extraction is performed on the point cloud data, image data, and text prompt data to obtain instance-level point cloud features, image features, and text features, respectively. The instance-level point cloud features and image features are fused to obtain instance-level fused features. The fused features and text features are combined to construct a fused feature-text feature pair, and comparative learning training is performed to align the fused features and text features in the feature space. Similarity calculation is performed between the fused features and text features to obtain the category of the new object, thus achieving the classification of the new object category and completing the open-vocabulary 3D object detection.
[0007] The beneficial effects of this invention are:
[0008] This invention optimizes the pseudo-labels of bounding boxes for new object categories by introducing prior geometric information and constructs a feature alignment module that fuses features and textual features to achieve localization and classification of new object categories. Compared to existing methods, this invention achieves higher localization accuracy on the SUN RGB-D and ScanNet datasets. For new object categories, this invention can achieve accurate localization and classification in 3D space without new category labels, enhancing the model's generalization and robustness in real-world scenarios. Attached Figure Description
[0009] Figure 1 This is an overall structural diagram of the present invention;
[0010] Figure 2 This is an optimized diagram of the 3D bounding box pseudo-label size according to the present invention;
[0011] Figure 3 This is the 3D bounding box pseudo-label orientation optimization diagram of the present invention;
[0012] Figure 4 This is an overall flowchart of the present invention;
[0013] Figure 5 This is a visual comparison chart of the present invention;
[0014] Figure 6 This is a visualization of the results of the present invention. Detailed Implementation
[0015] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0016] To address the problem of poor localization accuracy and inaccurate classification of new object categories in open-vocabulary 3D object detection scenarios, leading to insufficient generalization performance of the model, this invention studies the open-vocabulary 3D object detection problem from both localization and classification perspectives. It proposes an open-vocabulary 3D object detection method based on geometric perception and multimodal feature alignment. This method utilizes geometric structural information constraints to optimize the size, position, and orientation of the bounding boxes for new object categories, enhancing the model's ability to localize these new categories. Furthermore, it fuses instance-level point cloud features and image features to construct a more discriminative feature representation. Aligning the fused features with text features improves the accuracy and stability of the model's classification of new object categories. In addition, this invention enhances text prompts using scene context information and the unique functionality of objects, enriching the semantic information of the text prompts. This invention utilizes geometric structural information and constructs a more discriminative feature representation to improve the accuracy and generalization performance of open-vocabulary 3D object detection.
[0017] This invention proposes an open-vocabulary 3D object detection method based on geometric perception and multimodal feature alignment, improving the model's ability to locate and classify new object categories from two dimensions: localization and classification. In the localization stage, geometric structural information constraints are introduced to optimize the size, position, and orientation of the pseudo-labels of the bounding boxes of new object categories, thereby enhancing the model's ability to locate these new categories. In the classification stage, instance-level image features and point cloud features are fused to enhance the discriminative power of the features. Then, the fused features are aligned with text features across modalities to achieve the classification of new object categories.
[0018] An open-vocabulary 3D object detection method based on geometric perception and multimodal feature alignment, such as Figure 4As shown, the specific process includes: Open-vocabulary 3D object detection perceives objects in space in two stages: localization and classification. Localization stage: A 3D bounding box pseudo-label size optimization module is constructed to correct oversized or undersized bounding boxes, resulting in reasonably sized pseudo-labels. A 3D bounding box pseudo-label orientation correction module is used to correct the orientation of the pseudo-labels. Classification stage: After training in the category-independent localization stage, the model has the ability to locate new categories of objects in the scene. To further classify these new categories, instance-level point cloud data, image data, and text prompt data are acquired. Feature extraction is performed on the point cloud data, image data, and text prompt data to obtain instance-level point cloud features, image features, and text features. Instance-level point cloud features and image features are fused to obtain instance-level fused features. These fused features are then combined with text features to construct fused feature-text feature pairs, which are then trained through comparative learning. This involves treating fused features and text features belonging to the same object class as positive sample pairs, and others as negative sample pairs. In the feature space, the distance between positive sample pairs is reduced, while the distance between negative sample pairs is increased, aligning the fused features and text features. During the inference phase, the similarity between the fused features and text features is calculated to determine the category of the new object, thus achieving object classification. After training through the localization and classification phases, the model can detect new object categories in space, achieving open-vocabulary 3D object detection.
[0019] In terms of localization, this invention designs a geometry-guided 3D bounding box optimization module: First, a target mask in the image is obtained by inference using a pre-trained vision-language model. Then, it is projected onto 3D space using the camera's intrinsic and extrinsic parameter matrices to obtain a point cloud mask. Based on this mask, initial 3D bounding box pseudo-labels are estimated. However, due to severe incompleteness and outlier noise in the RGB-D point cloud, these pseudo-labels often suffer from inaccurate dimensions and orientation deviations. To mitigate the impact of outlier noise, this invention incorporates prior information on the dimensions of common indoor objects. Specifically, typical object dimensions are extracted using GPT-4 as prior constraints on object dimensions. For bounding boxes that are too large due to outliers, the point cloud of the instance object is iteratively downsampled to filter out outliers until the bounding box dimensions match the prior dimensions. For bounding boxes that are too small due to missing point clouds, the 3D bounding boxes are completed based on the prior dimensions to better approximate the true dimensions of the target. Furthermore, due to the incompleteness of the point cloud causing deviations in the orientation of the bounding boxes, this invention utilizes a point cloud surface with relatively complete geometric information along the camera's line of sight as a geometric reference to correct the bounding box orientation, making it more closely resemble the geometry of real objects. In the classification stage, this invention proposes a point cloud-image fusion and text alignment module. This module no longer uses the image as an intermediate bridge connecting the point cloud and text, but directly constructs point cloud-text pairs. To alleviate the problem of decreased feature discriminativeness caused by incomplete point cloud and poor quality of bounding box pseudo-labels, it further fuses instance-level point cloud (geometric information) and instance-level image (texture and color information) features. The fused features have stronger discriminative power, and cross-modal feature alignment is achieved through contrastive learning in the feature space, enabling the classification of new object categories. In addition, this invention incorporates typical contextual scenes and object functionality into the text prompts, further enhancing multimodal alignment and semantic discriminative capabilities.
[0020] This invention divides open-vocabulary 3D object detection into two stages: localization and classification. Specifically, as follows... Figure 1As shown: In the localization phase, to mitigate the size deviation of pseudo-labels, a 3D bounding box pseudo-label size optimization module was designed: prior size information obtained from a large language model was introduced, and the relevant point cloud was progressively downsampled until the bounding box size converged within the prior constraint range. Simultaneously, to correct the orientation error of pseudo-labels, a 3D bounding box pseudo-label orientation correction module was designed: the bounding box orientation was adjusted to make it closer to a reliable point cloud surface. In the classification phase, to alleviate the problem of decreased feature discriminativeness caused by incomplete point clouds and bounding box pseudo-label deviations, a point cloud-image fusion feature and text feature alignment module was introduced. Instead of using the image as an intermediate modality connecting the point cloud and text, a fusion feature-text feature pair was directly constructed, and the features of the two different modalities were aligned in a shared feature space. Regarding text prompt design, the expressiveness and discriminativeness of the text description were improved by combining the typical scenarios and unique functional attributes of the target.
[0021] Given a point cloud frame The paired RGB images are The associated text is The 2D and 3D bounding boxes are represented as follows: and The two-dimensional and three-dimensional masks are denoted as follows: and First, in the image The 2D bounding boxes are obtained using the pre-trained open-vocabulary 2D detector OV-DINO. And further use SAM to generate the corresponding mask. Based on the projection relationship between the image and the point cloud, a two-dimensional mask is created using a projection matrix. Mapped to three-dimensional space, we get Subsequently, mimicking manual annotation methods, a bounding box was fitted. The smallest bounding cube is used to generate the initial 3D bounding box pseudo-labels. However, due to the incompleteness of the point cloud and the presence of outliers, the estimated 3D bounding box pseudo-labels are... Significant deviations in key parameters such as size, position, and orientation can mislead the subsequent training process.
[0022] In open-vocabulary scenarios, initial 3D bounding box pseudo-labels are primarily generated by projecting a 2D mask onto 3D space using a projection matrix. However, limited by the imaging mechanism of RGB-D sensors, the acquired point cloud data often suffers from two serious biases: first, outliers around the object cause the estimated bounding box to be much larger than the actual physical size; second, occlusion or incomplete scanning leads to severe point cloud incompleteness, resulting in the loss of some geometric structural information, and the estimated bounding box cannot completely cover the object. To address this problem, this invention proposes a bounding box pseudo-label size optimization strategy that combines geometric prior constraints from a large language model with iterative downsampling.
[0023] Due to the lack of point cloud annotation data, the model cannot grasp the physical scale of unknown categories through supervised learning. This invention utilizes the powerful commonsense reasoning ability of large language models (such as GPT-4) to extract the target category. The typical geometric priors are defined as follows. For each category, its length, width, and height priors are defined as follows: This prior information on size provides a physical "anchor" for subsequent pseudo-label optimization. For example... Figure 2 As shown, for 3D bounding box pseudo-labels that are too large due to outliers... To eliminate outliers, the target point cloud is iteratively downsampled, and the bounding box is re-estimated after each iteration until the updated size is achieved. Compared with prior dimensions The relative error is less than the preset threshold. Outliers are filtered out, and the bounding box size converges to a reasonable range consistent with the prior size constraints, as shown in Equation (1-1).
[0024]
[0025] in, This indicates an outlier filtering operation. This represents a pseudo-label for a 3D bounding box after size optimization.
[0026] In point clouds acquired by RGB-D sensors, target objects are often partially occluded or certain material surfaces fail to reflect depth information, resulting in severely incomplete point clouds. When the point cloud is incomplete, geometric information is missing, leading to significant deviations in the orientation of the estimated bounding box. Typically, point cloud surfaces acquired along the camera's line of sight are more complete and denser, providing reliable geometric cues for orientation estimation. This invention optimizes the orientation of the initially estimated bounding box pseudo-labels by aligning them with reliable surfaces of the target point cloud, ensuring that the bounding boxes better match the spatial geometry of the target, thereby improving the quality of the pseudo-labels.
[0027] like Figure 3 As shown, from the BEV perspective, the green bounding box represents the correctly oriented bounding box, while the red bounding box is obtained by fitting the minimum area rectangle of the target point cloud. Since the target point cloud is incomplete, the orientation of the red bounding box has a significant deviation. To correct the bounding box orientation, from the bird's-eye view, the distance from each point in the target point cloud to the bounding box along the camera's line of sight is calculated. First, principal component analysis (PCA) is used to estimate the initial orientation of the 3D bounding box; then, using... Iteratively rotate the bounding box direction with a step size until it is covered. ; and calculate the total distance from all points along the camera's line of sight to the bounding box for each candidate orientation. Finally, choose to make The smallest candidate orientation is used as the optimized bounding box orientation. This method requires no additional training and can directly correct the bounding box orientation using geometric properties. The formal expressions are shown in equations (1-2) and (1-3):
[0028] First, the initial direction is estimated using PCA. Then, with step size Perform discrete iterative search:
[0029]
[0030] Ultimately, the optimal direction is obtained. :
[0031]
[0032] in, Indicating in candidate direction Below, the total distance between all points in the target point cloud and the bounding box along the camera's line of sight.
[0033] Next, the localization part of the 3D object detection model 3DETR is trained using 3D bounding box pseudo-labels optimized for size and orientation, enabling the model to have class-independent localization capabilities.
[0034] Inspired by CLIP, this invention directly constructs entity-level point cloud-text mappings. Specifically, it first uses a pre-trained 2D open-vocabulary detector, OV-DINO, to infer bounding boxes of target objects and their corresponding category labels on RGB images. Then, based on the camera's intrinsic and extrinsic projection matrices, these instance-level image-text correspondences are mapped onto a 3D scene to obtain instance-level point cloud-text category mappings. This method allows for the construction of point cloud-text pair data, enabling open-vocabulary classification through contrastive learning during the classification stage, without relying on RGB images as an intermediate bridge.
[0035] During the inference phase, given an entity-level point cloud... and the set of text prompts for the categories to be detected Normalized 3D point cloud features were extracted using a point cloud encoder and a text encoder, respectively. and text features The similarity between instance-level point cloud features and text prompts is calculated using cosine similarity, as shown in formula (1-4):
[0036]
[0037] The text category with the highest similarity score is selected as the final category prediction result, as shown in formula (1-5):
[0038]
[0039] Because point clouds are generally incomplete, even after optimization with bounding box pseudo-labels, unreliable 3D bounding box pseudo-labels may still be estimated for severely incomplete target point clouds, which can seriously mislead model training during the classification stage. To solve this problem, this invention utilizes fused point cloud-image features to enhance the discriminative power between categories.
[0040] Specifically, for an instance-level point cloud and its predicted 3D bounding box, the corresponding image patch is cropped using a projection matrix; then, point cloud features are extracted using a 3D encoder. CLIP image encoder extracts image patch features CLIP text encoder extracts text features Building upon this, a gating network is introduced to adaptively fuse 2D and 3D features across modalities, resulting in fused features. Finally, through contrastive learning, the fused features are compared with the corresponding text features. Alignment in the feature space enables open vocabulary classification. Its calculation is shown in formula (1-6):
[0041]
[0042] in, To compare the losses, For batch size, and The feature representation of the sample pairs to be aligned. The number of positive samples. This is a temperature hyperparameter used to scale the similarity score.
[0043] Ultimately, fusion features Text features The cross-modal contrast loss is shown in Equation (1-7):
[0044]
[0045] In open-vocabulary 3D object detection tasks, relying solely on isolated category labels often fails to capture the rich information of target objects. To address this bottleneck, this invention proposes a text prompt enhancement strategy driven by a large language model. Different semantic entities have strong associations with their environments; for example, beds are typically found in bedrooms, and sofas are usually placed in living rooms. Furthermore, different objects generally possess unique functional attributes. This invention extends text prompts to structured descriptions that include scene context information and target functional attribute information. Specifically, leveraging the powerful zero-shot commonsense reasoning capabilities of GPT-4, a semantically richer text prompt is generated using the designed template: “A photo of a typical {class}, which is commonly found in {scene}, and is often used for {affordance}.” This enhanced text prompt enriches the discriminative feature representation of single-category labels. In the cross-modal alignment stage, by constructing this semantically rich “fusion feature-text feature pair,” the model can achieve more robust feature alignment within the feature space, improving the model's cross-category generalization ability and semantic alignment robustness in open scenes. Specific comparison and visualization results are shown below. Figure 5 and Figure 6 As shown.
[0046] The above-described embodiments further illustrate the purpose, technical solution, and advantages of the present invention. It should be understood that the above-described embodiments are merely preferred embodiments of the present invention and are not intended to limit the present invention. Any modifications, equivalent substitutions, improvements, etc., made to the present invention within the spirit and principles of the present invention should be included within the protection scope of the present invention.
Claims
1. An open-vocabulary 3D object detection method based on geometric perception and multimodal feature alignment, characterized in that, This includes acquiring point cloud data, inputting the point cloud data into a trained open-vocabulary 3D object detection model, and obtaining the detection results; The open-vocabulary 3D object detection model perceives objects in space in two stages: localization and classification. The localization stage involves constructing a 3D bounding box pseudo-label size optimization module to correct oversized or undersized bounding boxes, resulting in appropriately sized pseudo-labels. A 3D bounding box pseudo-label orientation correction module is then used to correct the orientation of the pseudo-labels. In the classification stage, instance-level point cloud data, image data, and text prompt data are acquired; feature extraction is performed on the point cloud data, image data, and text prompt data respectively to obtain instance-level point cloud features, image features, and text features. Instance-level point cloud features and image features are fused to obtain instance-level fused features; the fused features and text features are used to construct fused feature-text feature pairs, and comparative learning training is performed to align the fused features and text features in the feature space; the similarity between the fused features and text features is calculated to obtain the category of the new object, thereby achieving the classification of the new object category and completing the open vocabulary 3D object detection.
2. The open vocabulary 3D object detection method based on geometric perception and multimodal feature alignment according to claim 1, characterized in that, The point cloud data is processed using a 3D bounding box pseudo-label size optimization module, which includes: acquiring point cloud data and using a large language model to obtain the size prior of new object categories; defining length, width, and height prior sets for each category; using the prior information in the prior sets as anchor points; for oversized bounding box pseudo-labels, iteratively downsampling the target point cloud based on the anchor points, filtering outoutliers that do not belong to the detected object, and re-estimating the bounding box after each iteration until the difference between the updated bounding box size and the size prior is less than a preset threshold, at which point the bounding box is retained; for small bounding boxes, the size of the bounding box is supplemented to complete the size prior range.
3. The open vocabulary 3D object detection method based on geometric perception and multimodal feature alignment according to claim 1, characterized in that, The point cloud data is processed using a 3D bounding box pseudo-label orientation optimization module. This includes: when the point cloud is incomplete, aligning the initially estimated bounding box with the surface of the target point cloud to correct its orientation, ensuring that the bounding box conforms to the spatial geometry of the target.
4. The open vocabulary 3D object detection method based on geometric perception and multimodal feature alignment according to claim 1, characterized in that, Feature extraction for point cloud data, image data, and text data includes: extracting instance-level point cloud features from instance-level point cloud data using the encoder of a 3D object detection model; extracting instance-level image features from instance-level image data using the CLIP image encoder; and extracting candidate category text features from candidate category text prompt data using the CLIP text encoder.
5. The open vocabulary 3D object detection method based on geometric perception and multimodal feature alignment according to claim 1, characterized in that, Constructing instance-level point cloud-image fusion features: The point cloud features and image features are fused using a gating network to obtain instance-level point cloud-image fusion features.
6. The open vocabulary 3D object detection method based on geometric perception and multimodal feature alignment according to claim 1, characterized in that, Calculating the similarity between fused features and text features includes: ; in, Indicates fusion characteristics, Representing text features, This represents the similarity score.
7. The open vocabulary 3D object detection method based on geometric perception and multimodal feature alignment according to claim 1, characterized in that, The training loss function for the open-vocabulary 3D object detection model is cross-modal contrastive loss.
8. The open vocabulary 3D object detection method based on geometric perception and multimodal feature alignment according to claim 7, characterized in that, The cross-modal contrast loss expression is: ; ; in, As a feature of fusion, For text features, To compare the losses, For batch size, and These are the feature representations of the sample pairs to be aligned. The number of positive samples. This refers to temperature hyperparameters.