A 3D visual task processing method and system based on multi-modal interaction enhancement

By combining a voxel-keypoint interaction method with a text-guided sampling mechanism, the problems of fine-grained information loss and low computational efficiency in 3D visual localization are solved, achieving high-precision and efficient 3D visual task processing. It is applicable to tasks such as 3D visual localization, 3D referential segmentation, 3D scene understanding, and autonomous driving.

CN122199786APending Publication Date: 2026-06-12ZHEJIANG UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
ZHEJIANG UNIV
Filing Date
2026-01-22
Publication Date
2026-06-12

Smart Images

  • Figure CN122199786A_ABST
    Figure CN122199786A_ABST
Patent Text Reader

Abstract

The application discloses a kind of 3D visual task processing method and system based on multimodal interaction enhancement.First, 3D point cloud data and natural language text are obtained;Second, 3D point cloud data is voxelized, and multi-scale voxel features are aggregated to a group of initial key points by voxel-point feature extraction network;Then, cross-modal correlation calculation is carried out using the semantic features of natural language text and the initial key point features to generate sampling weights;Differentiable sampling and feature reorganization are performed on the initial key point features to obtain target perception key point features;Finally, the target perception key point features and the features of natural language text are deeply fused to obtain multimodal fusion features;The corresponding prediction head is used to process the multimodal fusion features.The application uses sparse voxel convolution to extract features, avoids information loss caused by excessive downsampling in traditional point-based methods, and significantly improves the perception ability for small objects and complex scenes.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of computer vision and artificial intelligence, and in particular to a 3D vision task processing method and system based on multimodal interaction enhancement. Background Technology

[0002] 3D visual grounding aims to locate specific target objects in complex 3D scenes based on natural language descriptions. This is a key task connecting 3D visual perception and natural language understanding, and has broad application prospects in fields such as embodied intelligent robots, autonomous driving, and augmented reality. For example, a home service robot needs to understand the instruction "go get the red cup on the left side of the sofa" and accurately identify the object's location; an autonomous vehicle needs to understand instructions from passengers or a remote command center, such as "park in the temporary parking space under the blue billboard on the right side of the intersection ahead," or "watch out for the red truck changing lanes on the left front," and accurately identify the 3D location of the corresponding vehicle or area for regulatory decisions.

[0003] Existing 3D visual localization methods face two main challenges in scene feature extraction:

[0004] Point-based methods, such as those using PointNet++ as the backbone network, typically require drastic downsampling of the original point cloud (e.g., 50,000 points) (e.g., downsampling to 1024 or 2048 keypoints) to meet the computational demands of subsequent complex multimodal interaction modules. This aggressive downsampling leads to a significant loss of fine-grained geometric information, making it difficult for the model to locate small objects or distinguish between objects with similar appearances.

[0005] Voxel-based methods: While sparse voxel convolutions can preserve high-resolution scene details well, standard voxel networks generate a huge number of voxels (tens of thousands or even hundreds of thousands) during the decoding stage. Directly using such high-resolution voxel features for Transformer-based deep attention interactions with text is computationally infeasible, leading to memory overflow and excessively high inference latency.

[0006] Therefore, how to balance the high-resolution detail preservation capability of voxel representation with the computational efficiency of point representation, and effectively integrate text semantics to filter out irrelevant background interference, so as to accurately locate target objects in visual tasks, is a technical problem that urgently needs to be solved in the field of 3D visual positioning. Summary of the Invention

[0007] To address the problems of fine-grained information loss and low computational efficiency in the aforementioned background technologies, this invention proposes a 3D vision task processing method and system based on multimodal interaction enhancement. This invention combines the fine-grained perception capabilities of voxel convolution with the efficient interaction capabilities of compact keypoint features, and introduces a text-guided focusing sampling mechanism, thereby achieving high-precision and high-efficiency 3D multimodal task processing.

[0008] The present invention provides a 3D vision task processing method based on multimodal interaction enhancement, comprising:

[0009] Acquire 3D point cloud data and natural language text;

[0010] The 3D point cloud data is voxelized, and the multi-scale voxel features are aggregated into a set of initial key points through a voxel-point feature extraction network to obtain initial key point features containing scene geometric information.

[0011] Cross-modal correlation calculation is performed using the semantic features of the natural language text and the initial keypoint features to generate sampling weights that reflect the degree of correlation between each keypoint and the text;

[0012] Based on the sampling weights, perform soft sampling and feature recombination on the initial keypoint features to obtain target-aware keypoint features focused on text-related regions;

[0013] The target perception key point features are deeply fused with the features of the natural language text to obtain multimodal fusion features;

[0014] Depending on the task type, the multimodal fusion features are processed using the corresponding prediction head to fulfill the corresponding task requirements.

[0015] The present invention provides a 3D vision task processing system based on multimodal interaction enhancement, comprising:

[0016] The data acquisition module is used to acquire 3D point cloud data and natural language text;

[0017] The voxel-point feature extraction module is used to voxelize the 3D point cloud data and aggregate multi-scale voxel features into a set of initial key points through the voxel-point feature extraction network to obtain initial key point features.

[0018] The cross-modal interaction guidance module is used to perform cross-modal correlation calculation using the semantic features of the natural language text and the initial key point features, generate sampling weights, perform soft sampling and feature reorganization on the initial key point features, and output target-aware key point features.

[0019] The multimodal fusion module is used to deeply fuse the features of the target perception key points with the features of the natural language text to obtain multimodal fusion features;

[0020] The task execution module is used to call the corresponding prediction head according to the task type, process the multimodal fusion features, and output 3D bounding boxes, 3D segmentation masks or answer text to complete 3D visual localization, 3D referential segmentation or 3D question answering tasks.

[0021] The beneficial effects of this invention are:

[0022] High-fidelity feature representation: This invention innovatively proposes a visual localization architecture based on voxel-keypoint interaction, which uses sparse voxel convolution to extract features, avoiding information loss caused by oversampling in traditional point-based methods, and significantly improving the perception of small objects and complex scenes.

[0023] Efficient multimodal interaction: By distilling voxel features into a compact keypoint representation, the bottleneck of excessive computation in the multimodal interaction stage of voxel-based methods is overcome, achieving a balance between accuracy and efficiency.

[0024] Task-aware feature focusing: By introducing a text-guided keypoint sampling module, the limitations of traditional spatial uniform sampling are overcome. The model can actively "focus" on relevant regions based on language instructions, effectively suppressing background noise and interference, and significantly improving the accuracy of localization and segmentation.

[0025] High versatility: This architecture provides a general form of 3D vision-language feature representation, which is not only applicable to 3D visual localization, 3D referential segmentation, 3D scene understanding and 3D question answering, but can also empower downstream tasks such as autonomous driving, embodied intelligence, human-computer interaction and augmented reality, and has a wide range of application value. Attached Figure Description

[0026] Figure 1 This is a comparison between the overall flowchart of the method of this invention and existing methods;

[0027] Figure 2 This is a schematic diagram of the structure of the voxel-keypoint interactive backbone network in this invention;

[0028] Figure 3 This is a schematic diagram of the key point soft sampling module for text guidance in this invention;

[0029] Figure 4 This is a diagram showing the effect of the present invention on 3D visual positioning tasks, and a schematic diagram of the distribution of sampling key points;

[0030] Figure 5 This is a schematic diagram showing the results of comparative testing of representative scenario examples selected in the embodiments of the present invention. Detailed Implementation

[0031] The present invention will be further described below with reference to the accompanying drawings and specific embodiments.

[0032] This application provides a 3D visual localization method based on text-guided voxel-keypoint interaction enhancement, comprising the following steps:

[0033] Step S1: Obtain the 3D point cloud data to be processed and the text query statement describing the target object;

[0034] Step S2: Perform voxelization on the 3D point cloud data, and use a voxel-point feature extraction network to encode the scene, aggregating sparse voxel features onto a set of initial key points to obtain initial key point features containing scene geometric information.

[0035] Step S3: Execute the text-guided multimodal interaction and sampling steps, and use the semantic features of the text query statement to perform cross-modal correlation calculation on the initial key point features to generate sampling weights that reflect the degree of correlation between key points and text descriptions.

[0036] Step S4: Based on the sampling weights, perform differentiable soft sampling and feature recombination on the initial key point features to generate target perception key point features distributed in the text-related regions, so as to filter background noise in the scene;

[0037] Step S5: Deeply fuse the target perception key point features with the features of the text query statement to obtain multimodal fusion key point features;

[0038] Step S6: Based on the multimodal fusion features, the 3D bounding box of the target object is regressed by the localization prediction head to complete the 3D visual localization.

[0039] Preferably, the voxel-point feature extraction network in step S2 includes a sparse voxel convolution branch and a voxel set abstraction module;

[0040] The sparse voxel convolution branch is used to construct a multi-scale voxel feature pyramid to preserve the fine-grained geometry of the scene.

[0041] The voxel set abstraction module is used to aggregate neighborhood voxel features from different levels of the voxel feature pyramid, using the initial keypoint as the anchor point, and transforms the high-resolution voxel representation into a compact keypoint representation, which is beneficial for efficient interaction with text features.

[0042] Preferably, step S3 specifically involves:

[0043] First, the attention map between the text query features and the initial key point features is calculated using a cross-attention mechanism to obtain the visual features for text enhancement.

[0044] The text enhancement visual features are processed using a self-attention layer, and the non-normalized importance score of each initial keypoint is predicted through a fully connected layer.

[0045] Preferably, step S4 specifically involves:

[0046] A reparameterization mechanism is introduced to transform the non-normalized importance score into a differentiable soft sampling weight;

[0047] The initial keypoint features are weighted and aggregated using the soft sampling weights to output a set of keypoint features that mainly contain semantic information of the target object and have target perception capabilities.

[0048] Based on the above concept, Example 1 is given: 3D Visual Grounding (3D VG) method.

[0049] like Figure 1 and Figure 2 As shown, this embodiment provides a visual positioning method, the specific steps of which are as follows:

[0050] Step 1: Data Preprocessing and Voxel Feature Extraction

[0051] The input is point cloud data with N points. (For RGB-D cameras, this includes coordinates XYZ and color RGB; for LiDAR, it includes coordinates XYZ and reflection intensity) as well as free-form text descriptions (such as "find the white refrigerator on the left side of the table").

[0052] Point clouds are divided into voxel grids and input into a sparse voxel convolutional network. The network consists of multiple sparse convolutional blocks, each of which is downsampled through convolutions with a stride of 2 to generate voxel feature pyramids at different resolutions. Simultaneously, deep voxel features are compressed along the Z-axis to form bird's-eye view (BEV) features, serving as supplementary global context. For text descriptions, pre-trained language models such as RoBERTa extract text features. .

[0053] Step 2: Feature aggregation from voxels to keypoints

[0054] The voxel feature pyramid is enormous. In order to enable it to efficiently complete multimodal interaction with text features, this embodiment introduces a voxel-keypoint aggregation mechanism, which aggregates the massive voxel features into highly compact keypoint representations to bridge scene features with subsequent point-based multimodal interaction modules.

[0055] In a preferred example:

[0056] First, select from the original point cloud using farthest point sampling (FPS). There are initial keypoints (e.g., 1024). Then, using the voxel set abstraction module, for each keypoint, a query radius is set at each level of the voxel pyramid. Non-empty voxel characteristics within the aggregation radius.

[0057] The aggregation formula is as follows:

[0058]

[0059] in These are the coordinates of the key points. It is a voxel characteristic.

[0060] Features after aggregation The multi-scale geometric features of the key point are formed through multilayer perceptron (MLP) and max pooling operations.

[0061] Finally, the features from each level are concatenated with the BEV features to form a keypoint feature vector rich in detail. .

[0062] Compared to the massive voxel feature pyramid, these keypoint feature vector representations are very compact and information-rich, enabling highly efficient interaction with text features in the future.

[0063] Step 3: Keypoint Sampling for Text Guidance

[0064] To filter out background irrelevant to the text description, this embodiment designs a key point sampling module for text guidance, utilizing text features. To guide the selection of key points, the specific structure is as follows: Figure 3 As shown in the diagram. This module aims to redirect the distribution of keypoints based on a soft sampling mechanism, allowing keypoints and their features to cluster around the text-described areas and objects, thereby effectively improving the model's target perception and visual localization performance. The specific steps are as follows:

[0065] First, relevance calculation is performed to calculate key point features. Text features Cross-attention yields visual features for text enhancement. Secondly, weight prediction is performed using a self-attention layer and a fully connected layer, based on... Predict the relevance weights for each key point ;

[0066] Next, differentiable keypoint soft sampling is performed based on the association weights, and the weights are transformed into soft-sampled weights using the Gumbel-Softmax technique. :

[0067]

[0068] in It is Gumbel noise. Temperature coefficient;

[0069] Finally, keypoint sampling is completed based on soft sampling weights, utilizing... The original keypoint features are weighted and summed to output... 256 target perception key point features .

[0070] like Figure 4 As shown, the green "seed points" in the fourth column are the sampling points obtained through FPS. They are scattered throughout the scene. After processing by this module, as shown in the red "text-guided key points" in the fifth column, the spatial distribution of key point features will change from a uniform distribution to a close focus around the object described in the text (such as "the chair next to the sofa"). By clustering key point features near the ground truth of the target, the target perception ability of key points can be greatly improved, thus enhancing the visual localization performance of the model.

[0071] Step 4: Objective Regression and Prediction

[0072] Key features of target perception The data is input into a multimodal fusion decoder (such as a Transformer decoder) to predict the center coordinates, size, and orientation of the target object, and outputs the final predicted 3D bounding box of the target. Specific effects are shown below. Figure 4 , Figure 5 The target bounding box (green) is shown in the results of the "Invention".

[0073] This application also provides a 3D reference representation segmentation (3D RES) method based on multimodal interaction enhancement, including the following steps:

[0074] Step S1: Obtain the 3D point cloud data to be processed and the text query statement describing the target object;

[0075] Step S2: Perform voxelization on the 3D point cloud data, and use a voxel-point feature extraction network to encode the scene, aggregating sparse voxel features onto a set of initial key points to obtain initial key point features containing scene geometric information.

[0076] Step S3: Execute the text-guided multimodal interaction and sampling steps, and use the semantic features of the text query statement to perform cross-modal correlation calculation on the initial key point features to generate sampling weights that reflect the degree of correlation between key points and text descriptions.

[0077] Step S4: Based on the sampling weights, perform differentiable soft sampling and feature recombination on the initial key point features to generate target perception key point features distributed in the text-related regions, so as to filter background noise in the scene;

[0078] Step S5: Deeply fuse the target perception key point features with the features of the text query statement to obtain multimodal fusion key point features;

[0079] Step S6': Use the segmentation prediction head to process the multimodal fusion features, predict the probability mask of each target perception key point belonging to the target object, and generate the 3D segmentation result of the target object.

[0080] Based on the above concept, this embodiment 2 is: 3D Referring Expression Segmentation (3D RES).

[0081] This embodiment adds a segmentation prediction branch to the existing embodiment 1. After obtaining the multimodal fusion features, it not only regresses the bounding box, but also uses a segmentation head to predict the probability mask of each keypoint belonging to the target object, generating a 3D segmentation result of the target object.

[0082] Because this embodiment preserves high-resolution feature extraction at the voxel level, it performs excellently in edge detail processing for segmentation tasks, generating fine-grained target masks. Specific results are as follows: Figure 4 , Figure 5 The point cloud segmentation mask (red) is shown in the results of the "Invention" section. Furthermore, this technology is also applicable to autonomous driving tasks such as traffic element localization or unstructured road area segmentation under specific commands in outdoor scenes. For example, based on the command "drive in the leftmost left-turn lane," the model needs to segment the corresponding lane area point cloud to assist in high-precision map construction or vehicle navigation.

[0083] This application also provides a 3D question answering (3D QA) method based on multimodal interaction enhancement, including the following steps:

[0084] Step S1: Obtain the 3D point cloud data to be processed and the text query statement describing the target object; wherein the text query statement is a question text about the 3D scene;

[0085] Step S2: Perform voxelization on the 3D point cloud data, and use a voxel-point feature extraction network to encode the scene, aggregating sparse voxel features onto a set of initial key points to obtain initial key point features containing scene geometric information.

[0086] Step S3: Execute the text-guided multimodal interaction and sampling steps, and use the semantic features of the text query statement to perform cross-modal correlation calculation on the initial key point features to generate sampling weights that reflect the degree of correlation between key points and text descriptions.

[0087] Step S4: Based on the sampling weights, perform differentiable soft sampling and feature recombination on the initial key point features to generate target perception key point features distributed in the text-related regions, so as to filter background noise in the scene;

[0088] Step S5”: The target perception key point features are used as the visual context after text filtering, and input with the features of the question text into the question answering decoder for fusion reasoning;

[0089] Step S6”: Output the answer to the question text through the classification head or generation head.

[0090] Based on the above concept, this embodiment 3: 3D Visual Question Answering (3DVQA)

[0091] This embodiment demonstrates the application of the system in visual question answering tasks.

[0092] Input a 3D scene point cloud and question text (e.g., "How many black chairs are in the room?"). The processing procedure is as follows:

[0093] Scene features are extracted using sparse voxel convolution; a text-guided keypoint sampling module is used to focus on relevant object instances in the scene based on the question text ("black chair");

[0094] The focused key point features are input into the QA inference module (such as a Transformer-based decoder).

[0095] The output layer categorizes the results to obtain answers (such as "3") or generates natural language responses.

[0096] This application also provides a computer device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that the processor executes the program to implement the steps of a 3D vision task processing method based on multimodal interaction enhancement.

[0097] This application also provides a computer-readable storage medium storing a computer program thereon, characterized in that, when the computer program is executed by a processor, it implements the steps of a 3D vision task processing method based on multimodal interaction enhancement.

[0098] In summary, the core advantage of this application lies in the fact that, based on high-resolution voxel features, the text-guided keypoint sampling module can help QA models quickly locate visual cues related to the question in cluttered 3D scenes, thereby improving the accuracy of the answer.

[0099] The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the protection scope of the present invention.

Claims

1. A 3D vision task processing method based on multimodal interaction enhancement, characterized in that, include: Acquire 3D point cloud data and natural language text; The 3D point cloud data is voxelized, and the multi-scale voxel features are aggregated into a set of initial key points through a voxel-point feature extraction network to obtain initial key point features containing scene geometric information. Cross-modal correlation calculation is performed using the semantic features of the natural language text and the initial keypoint features to generate sampling weights that reflect the degree of correlation between each keypoint and the text; Based on the sampling weights, perform soft sampling and feature recombination on the initial keypoint features to obtain target-aware keypoint features focused on text-related regions; The target perception key point features are deeply fused with the features of the natural language text to obtain multimodal fusion features; Depending on the task type, the multimodal fusion features are processed using the corresponding prediction head to fulfill the corresponding task requirements.

2. The method according to claim 1, characterized in that, The voxel-point feature extraction network includes a sparse voxel convolution branch and a voxel set abstraction module; The sparse voxel convolution branch is used to construct a multi-scale voxel feature pyramid. The voxel set abstraction module uses the initial keypoint as the anchor point to aggregate neighborhood voxel features across levels, transforming the high-resolution voxel expression into a compact keypoint expression.

3. The method according to claim 2, characterized in that, The cross-modal correlation calculation adopts a cross-attention mechanism, which first obtains the visual features of text enhancement, and then predicts the non-normalized importance score of each key point through a self-attention layer and a fully connected layer.

4. The method according to claim 3, characterized in that, The soft sampling transforms the non-normalized importance score into soft sampling weights through a reparameterization mechanism, and then weights and aggregates the initial key point features to generate target perception key point features.

5. The method according to claim 1, characterized in that, The deep fusion employs cascading, additive, or cross-attention methods to map target perception key point features and text features to the same semantic space, thereby obtaining multimodal fusion features.

6. The method according to any one of claims 1 to 5, characterized in that, When the task type is 3D visual localization, the prediction head is a 3D bounding box regression head, which outputs the center, size and orientation of the target object.

7. The method according to any one of claims 1 to 5, characterized in that, When the task type is 3D referential segmentation, the prediction head is a segmentation head that outputs a probability mask for each key point belonging to the target object.

8. The method according to any one of claims 1 to 5, characterized in that, When the task type is 3D question answering, the natural language text is the question text, the prediction head is a classification or generation head, and the corresponding answer is output.

9. A 3D vision task processing system based on multimodal interaction enhancement, characterized in that, include: The data acquisition module is used to acquire 3D point cloud data and natural language text; The voxel-point feature extraction module is used to voxelize the 3D point cloud data and aggregate multi-scale voxel features into a set of initial key points through the voxel-point feature extraction network to obtain initial key point features. The cross-modal interaction guidance module is used to perform cross-modal correlation calculation using the semantic features of the natural language text and the initial key point features, generate sampling weights, perform soft sampling and feature reorganization on the initial key point features, and output target-aware key point features. The multimodal fusion module is used to deeply fuse the features of the target perception key points with the features of the natural language text to obtain multimodal fusion features; The task execution module is used to call the corresponding prediction head according to the task type, process the multimodal fusion features, and output 3D bounding boxes, 3D segmentation masks or answer text to complete 3D visual localization, 3D referential segmentation or 3D question answering tasks.

10. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1 to 8.