A 3D multi-target detection method based on semantic driving single image for ATS
By employing the semantic-driven approach of ATS, combined with natural language processing and multi-head self-attention mechanism, multi-object 3D detection is achieved. This solves the problems of high cost and insufficient semantic understanding in existing 3D vision detection technologies, improves detection accuracy and efficiency, and is applicable to fields such as intelligent transportation and autonomous driving.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- CHANGAN UNIV
- Filing Date
- 2025-01-24
- Publication Date
- 2026-06-23
AI Technical Summary
In existing technologies, 2D visual inspection cannot capture the depth and spatial relationships of 3D objects, high-cost sensors such as LiDAR are difficult to promote, and existing methods lack semantic understanding in multi-object scenes, affecting the reliability and applicability of detection.
We employ an ATS-based semantic-driven approach, extracting the 3D bounding boxes and 2D projections of objects through natural language processing and a 3D detector. By combining a multi-head self-attention mechanism and a cross-modal semantic alignment strategy, we achieve accurate detection and localization of multiple targets, reducing our reliance on expensive sensors.
It improves the accuracy and efficiency of 3D object detection, reduces computational complexity, and enhances adaptability in complex traffic scenarios, making it suitable for fields such as intelligent transportation, autonomous driving, and augmented reality.
Smart Images

Figure QLYQS_10 
Figure QLYQS_26 
Figure QLYQS_31
Abstract
Description
Technical Field
[0001] This invention relates to the field of video detection technology, and in particular to a semantically driven 3D multi-object detection method for ATS-based single-image detection. Background Technology
[0002] With the rapid development of artificial intelligence (AI) technology, enabling machines to understand and connect natural language and visual information has become a key challenge in human-computer interaction and scene understanding. This technology has broad application potential in multiple fields, driving the intelligent development of various industries. In the field of autonomous driving, vehicles can identify and locate target objects on the road through natural language commands, optimize driving strategies, and improve driving safety and efficiency. Drivers can use voice commands to instruct vehicles to identify obstacles, predict pedestrian behavior, and plan routes, thereby achieving a higher level of autonomous driving. In the field of intelligent robots, robots can perform complex tasks based on natural language commands, understand diverse contexts, and react accordingly. This technology can enhance the application scenarios of robots, such as home services and industrial manufacturing, improving work efficiency and flexibility. Furthermore, in intelligent traffic management, real-time monitoring and analysis technologies can help management departments optimize traffic flow and reduce congestion and accidents. Through cross-modal data recognition, traffic conditions can be rapidly transmitted, improving decision-making efficiency and optimizing resource allocation.
[0003] Existing detection technologies are mainly divided into two major directions: 2D vision detection and 3D vision detection. However, existing technologies have many drawbacks, including:
[0004] (1) In 2D vision detection, datasets mainly focus on 2D images. However, real-world scenes are inherently 3D, and 2D information alone is insufficient to capture object depth, spatial relationships, and structural geometry.
[0005] (2) 3D vision inspection based on high-cost sensors such as lidar or millimeter-wave radar. Although these sensors can provide high-precision 3D point cloud data, they are expensive and require high computing resources. In resource-constrained environments (such as consumer devices or small robotic systems), these methods are difficult to promote and deploy in practice, which limits their wider applicability.
[0006] (3) Mono3DVG has explored language-guided monocular 3D detection, but its scope is limited to single object detection. However, in real-world scenarios, there are usually multiple objects, and the semantic relationships between these objects are crucial for comprehensive scene understanding. Insufficient semantic understanding of multiple objects may lead to insufficient scene understanding capabilities, thereby affecting the reliability of practical applications.
[0007] This invention addresses the problem of expensive and limited-application LiDAR-based 3D visual positioning equipment by proposing an innovative ATS (Automatic Traffic Assistance System) text-guided multimodal multi-target 3D perception technology. This technology, combined with natural language descriptions, enables accurate detection and localization of multiple ATS vehicles within a single image. The method extracts the 3D bounding boxes and 2D projections of the ATS vehicles using an advanced 3D detector, and extracts linguistic features using natural language processing. By fusing 2D image information with 3D geometric information, the method achieves a complete target representation. Furthermore, this invention employs a selective matching module, utilizing a multi-head self-attention mechanism and a cross-modal semantic alignment strategy to effectively correlate linguistic and visual features, thereby significantly improving detection accuracy and efficiency. This method eliminates reliance on expensive sensors, significantly lowering the technical threshold and enhancing its adaptability to complex traffic scenarios and vehicle status perception. It has broad application prospects, particularly in intelligent transportation, autonomous driving, and augmented reality.
[0008] Compared with existing technologies, this invention demonstrates superior multimodal fusion and cross-modal semantic understanding capabilities. It is an efficient, low-cost, and easily deployable 3D target detection solution that can provide strong support for the perception and decision-making of autonomous transportation systems. Summary of the Invention
[0009] In view of this, the present invention provides a semantically driven 3D multi-object detection method for ATS based on a single image.
[0010] To solve the above-mentioned technical problems, the present invention adopts the following technical solution:
[0011] A semantically driven 3D multi-object detection method for ATS-based single-image processing includes the following steps:
[0012] Step 1: Process the input RGB image to extract the 3D bounding boxes of each object in the image; generate all potential 2D projections of the 3D objects in the scene;
[0013] Step 2: Using natural language processing techniques, extract keywords, phrases, and their semantic information from the description to form feature information P representing the language description. t ;
[0014] Step 3: Merge the 2D image information and 3D geometric information of the object to obtain a complete object representation. a ;
[0015] Step 4: Associate the linguistic features extracted from the language description with the detected 3D objects to capture the semantic correspondence between the text and the visual modality;
[0016] Step 5: Filter the targets based on the generated matching scores to obtain all targets that match the natural language description.
[0017] Preferably, step three includes the following steps:
[0018] Step 3.1: For each 3D object, use its corresponding 2D bounding box to crop the object region from the original RGB image and obtain the corresponding 2D image information;
[0019] Step 3.2: Utilize 2D image information to extract visual features f of the 3D object through a pre-trained network. v The size is 768 × the number of patches in the image, and visual features f are processed through a multi-head self-attention mechanism. v Obtain fine features f' v Calculate the attention weights between different parts of the image features to effectively capture the relationships between different regions within an object;
[0020] Step 3.3: For each 3D object, using its corresponding 3D geometric information, construct a text description through a designed fixed template, input it into a pre-trained language model, and encode it into a text embedding vector f. t ;
[0021] Step 3.4: Guided by image features, relevant semantic information is extracted from 3D text features and fused with visual features to form a joint representation of the object's appearance and geometric attributes, thus obtaining complete object information. a .
[0022] Preferably, step 3.4 includes the following:
[0023] A dual-head attention mechanism is adopted to focus image features f' v Consider it as query Q, which will come from 3D text features Cross-attention is calculated using key K and value V, as shown in the following formula:
[0024]
[0025] Where, Q∈R L×D And K,V∈R 1×D ;
[0026] Calculate the query-key attention graph A using the above Q and K. tt And aggregate the weight information V to obtain a visual and 3D text-aware query Q', as shown in the following formula:
[0027]
[0028] Where D is the length of the feature vector, i.e., the dimension of each query, key, or value vector, Q'∈R L×D .
[0029] Preferably, step four includes the following steps:
[0030] Step 4.1: Apply a bidirectional attention mechanism to achieve initial fusion between language and object features, allowing them to complement and enhance each other; language features guide the model to focus on the visual aspects of the object related to the description, while the visual features of the object enrich the semantic information of the language description. t and object characteristics f a Alternating between query, key, and value, the process is as follows:
[0031] O2T=MHCA(p t ,f a ,f a T2O=MHCA(f a ,p t ,p t (3)
[0032] Where O2T∈R C×D And T2O∈R L×D ;
[0033] Step 4.2: The object and language features O2T and T2O of the fusion are first concatenated to obtain x. input The input features are fed into the module for adaptive fusion, effectively capturing the interactions between cross-modal features. input First, channel blending is performed using convolution, then activation is applied using a function to obtain x. mixed Then, the processed feature x mided The inputs are given to the forward and backward modules, respectively, and S is obtained. forward and S backward These modules work in parallel, capturing contextual information from different directions in the feature sequence.
[0034] Preferably, step 4.2 includes the following:
[0035] x input =Concat(MLP(T2O),MLP(O2T)) (4)
[0036] x mixed =SiLU(Conv 1x1 (x input (5)
[0037] S forward =SSM(x mixed (6)
[0038] S backward =Flip(SSM(Flip(x) mixed ))) (7).
[0039] Preferably, step five includes the following:
[0040] Binary cross-entropy loss BCELoss is used to supervise classification and matching performance. BCELoss measures the difference between the predicted probability of the target class and its true label, and then calculates the cosine similarity between each object feature and the linguistic feature. The similarity score is then input into the contrastive loss function to improve the model's ability to distinguish between matching and non-matching objects.
[0041] The present invention achieves the following technical effects compared to the prior art:
[0042] (1) This invention improves the accuracy and efficiency of retrieval and identification by integrating a traffic large model intelligent traffic event identification method based on a selective state space model cross-modal retrieval technology. Compared with existing cross-modal retrieval methods, this invention not only achieves higher identification accuracy, but also significantly reduces computational complexity.
[0043] (2) The present invention uses a selective filtering module and a selective alignment module based on a state space model to optimize and update the representation learned by the model, which solves the problems of invalid semantic alignment and high computational complexity caused by global allocation in the traditional cross-modal attention mechanism;
[0044] (3) By performing implicit and selective fine-grained matching between images and text, the model can effectively filter out irrelevant information, enhance the perception and alignment of key features, thereby improving the accuracy and speed of cross-modal retrieval and greatly improving the ability to accurately identify and locate traffic incidents. Detailed Implementation
[0045] The technical solutions in the embodiments of the present invention will be clearly and completely described below. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0046] This invention discloses a semantically driven 3D multi-object detection method for ATS (Automatic Data Set Detection) based on a single image, comprising the following steps:
[0047] Step 1: Process the input RGB image using an advanced 3D detector to extract the 3D bounding boxes of each object in the image; the detector generates all potential 2D projections of these 3D objects in the scene;
[0048] Step Two: Transform the natural language description into machine-processable feature information; using natural language processing techniques, extract keywords, phrases, and their semantic information from the description to form feature information P representing the language description. t ;
[0049] Step 3: Merge the 2D image information and 3D geometric information of the object to obtain a complete object representation. a ;
[0050] The specific implementation method is as follows:
[0051] Step 3.1: For each 3D object, use its corresponding 2D bounding box to crop the object region from the original RGB image and obtain the corresponding 2D image information;
[0052] Step 3.2: Utilize 2D image information to extract visual features f of the 3D object through a pre-trained network. v The size is 768 × the number of patches in the image, and visual features f are processed through a multi-head self-attention mechanism. v Obtain fine features f' v Calculate the attention weights between different parts of the image features to effectively capture the relationships between different regions within an object;
[0053] By focusing on key regions while suppressing irrelevant information, the expressive power of features is enhanced, thereby improving the overall feature representation.
[0054] Step 3.3: For each 3D object, using its corresponding 3D geometric information, a text description is constructed through a designed fixed template. This text description is input into a pre-trained language model, which encodes it into a text embedding vector f. t ;
[0055] Step 3.4: Guided by image features, relevant semantic information is selectively extracted from 3D text features and fused with visual features to form a joint representation of the object's appearance and geometric attributes, thus obtaining complete object information. a ;
[0056] The specific implementation method is as follows:
[0057] A dual-head attention mechanism is adopted to focus image features f' v Treat it as a query (Q), and use the features from 3D text. Cross-attention is calculated using the key (K) and value (V) as follows:
[0058]
[0059] Where, Q∈R L×DAnd K,V∈R 1×D .
[0060] Calculate the query-key attention graph A using the above Q and K. tt And aggregate the weight information V to obtain a visual and 3D text-aware query Q', as shown in the following formula:
[0061]
[0062] Where D is the length of the feature vector, i.e., the dimension of each query, key, or value vector, Q'∈R L×D ;
[0063] Step 4: Associate the linguistic features extracted from the language description with the detected 3D objects to capture the semantic correspondence between the text and the visual modality;
[0064] The specific implementation method is as follows:
[0065] Step 4.1: For a given description, selectively focus on relevant parts, applying a bidirectional attention mechanism to achieve initial fusion between language and object features, allowing them to complement and enhance each other; language features guide the model to focus on the visual aspects of the object related to the description, while the object's visual features enrich the semantic information of the language description. Language features P t and object characteristics f a Used alternately as query, key, and value.
[0066] The process is as follows:
[0067] O2T=MHCA(p t ,f a ,f a T2O=MHCA(f a ,p t ,p t (3)
[0068] Where O2T∈R C×D And T2O∈R L×D ;
[0069] Step 4.2: The fused object and language features (O2T and T2O) are first concatenated to obtain x. input The input features are fed into the module for adaptive fusion, effectively capturing the interactions between cross-modal features. input First, channel blending is performed using convolution, then activation is applied using a function to obtain x. mixed Then, the processed feature x mixed The inputs are given to the forward and backward modules, respectively, and S is obtained. forward and S backwardThese modules work in parallel, capturing contextual information from different directions in the feature sequence.
[0070] The process is as follows:
[0071] x input =Concat(MLP(T2O),MLP(O2T)) (4)
[0072] x mixed =SiLU(Conv 1x1 (x input (5)
[0073] S forward =SSM(x mixed (6)
[0074] S backward =Flip(SSM(Flip(x) mixed ))) (7)
[0075] Step 5: Filter the targets based on the generated matching scores to obtain all targets that match the natural language description;
[0076] Binary cross-entropy loss (BCELoss) is used to supervise classification and matching performance;
[0077] BCELoss measures the difference between the predicted probability of the target class and its true label, then calculates the cosine similarity between each object feature and the linguistic feature, and inputs the similarity score into the contrastive loss function to improve the model's ability to distinguish between matching and non-matching objects.
[0078] Example 1
[0079] Experimental conditions:
[0080] The experiment was conducted using the PyTorch deep learning framework on a platform equipped with an NVIDIA GeForce RTX 3090 GPU.
[0081] The dataset is divided into training, validation, and test sets in a 3:1:1 ratio.
[0082] The methods compared in the experiment are as follows:
[0083] One type is a feature fusion module based on the attention mechanism, referred to as AFM in the experiment. It is used to perform feature fusion between 3D point cloud and visual features. Based on the self-attention mechanism, the model can automatically learn to weight and fuse 3D point cloud data and 2D visual feature data. That is, attention scores are calculated between point cloud features and image features to determine which parts of point cloud and image information are most important for target localization and ignore irrelevant information.
[0084] One model is based on the Transformer architecture, referred to as AFM3DVG-Transformer in the experiment. It is used to achieve visually guided target localization on 3D point clouds. It solves the key problem of fusion between 3D point clouds and natural language description by self-attention and cross-modal relationship modeling. It establishes multi-scale and multi-relational associations between point clouds and language. The model finally generates a segmentation result or localization result of a target point cloud region.
[0085] One method is a 3D object detection method based on natural language description, referred to as ScanRefer in the experiment. It uses natural language description to accurately locate target objects in 3D point clouds, extracts point cloud features and language features separately, and effectively fuses point cloud and language features using a cross-modal attention mechanism. The attention mechanism is used to generate fused multimodal features, and the point cloud features of each object are enhanced, which can better represent the correlation with the language description. This method achieves accurate localization of target objects in complex 3D scenes.
[0086] One is a 3D vision detection model based on multi-view learning and the Transformer architecture, referred to as Multi-View Transformer in the experiment. It solves the target localization task in 3D point cloud scenes. It captures the global and local features of the scene through multi-view projection, and uses Transformer to realize the modeling of intramodal and intermodal relationships. It dynamically learns the complex semantic relationship between language description and point cloud features. Multi-layer Transformer comprehensively captures the spatial relationships and contextual information between objects in the point cloud.
[0087] The last method is a 3D visual detection method based on multimodal feature fusion, referred to as Multi3DRefer in the experiment. It accurately locates natural language descriptions to multiple target objects in a 3D scene, extracts point cloud features and language features separately, and uses a multimodal feature fusion module to align language descriptions with point cloud features through a cross-modal attention mechanism. At the same time, it dynamically captures the association between language descriptions and multiple targets in the scene. A scene relationship modeling module is also designed to construct a language-guided global relationship matrix by combining the spatial position and semantic relationship between objects, accurately modeling the contextual information between objects, and realizing the localization and classification of targets.
[0088] As can be seen, this invention creates two datasets, MM3DRefer and MT3DRefer, which provide a large amount of natural language descriptions, corresponding 3D object annotations and multi-object relationship information, thus solving the shortcomings of existing datasets in multi-object 3D vision basic tasks.
[0089] Experiment content:
[0090] According to the specific embodiments of the present invention, the detection evaluation index of the test dataset in the datasets MM3DRefer and MT3DRefer is calculated and compared with the indexes of the AFM method, AFM3DVG-Transformer method, ScanRefer method, Multi-ViewTransformer method and Multi3DRefer method. The results are shown in Table 1 and Table 2, where ↑ indicates that the higher the better and ↓ indicates that the lower the better.
[0091] Table 1 Evaluation metrics for the MM3DRefer dataset detection
[0092]
[0093]
[0094] Table 2 Evaluation metrics for the MT3DRefer dataset detection
[0095]
[0096] As can be seen from Tables 1 and 2, due to the selective fusion module and selective interaction module based on selective fusion and interaction mechanisms employed in this invention, the synergistic effect of these two modules enables the cross-pattern matching architecture to more accurately capture the semantic correspondence between text descriptions and object visual features, maintaining high basic performance even in the presence of object detection errors. Therefore, it achieves extremely significant retrieval results, verifying the advancement of this invention.
[0097] The above description is merely a preferred embodiment of the present invention and does not constitute any limitation on the technical scope of the present invention. Therefore, any minor modifications, equivalent changes, and alterations made to the above embodiments based on the technical essence of the present invention shall still fall within the scope of the technical solution of the present invention.
Claims
1. A semantically driven 3D multi-object detection method for ATS (Automatic Test System) based on a single image, characterized in that, Includes the following steps: Step 1: Process the input RGB image to extract the 3D bounding boxes of each object in the image; generate all potential 2D projections of the 3D objects in the scene; Step Two: Using natural language processing techniques, extract keywords, phrases, and their semantic information from the description to form feature information representing the language description. ; Step 3: Merge the 2D image information and 3D geometric information of the object to obtain a complete object representation. ; Step 4: Associate the linguistic features extracted from the language description with the detected 3D objects to capture the semantic correspondence between the text and the visual modality; Step 5: Filter the targets based on the generated matching scores to obtain all targets that match the natural language description; Step three includes the following steps: Step 3.1: For each 3D object, use its corresponding 2D bounding box to crop the object region from the original RGB image and obtain the corresponding 2D image information; Step 3.2: Utilize 2D image information to extract visual features of 3D objects through a pre-trained network. The size is 768 × the number of patches in the image, and visual features are processed through a multi-head self-attention mechanism. Obtaining image features Calculate the attention weights between different parts of the image features to effectively capture the relationships between different regions within an object; Step 3.3: For each 3D object, using its corresponding 3D geometric information, construct a text description through a designed fixed template, input it into a pre-trained language model, and encode it as a text embedding vector. ; Step 3.4: Guided by image features, relevant semantic information is extracted from 3D text features and fused with visual features to form a joint representation of the object's appearance and geometric attributes, thus obtaining a complete object representation. ; Step four includes the following steps: Step 4.1: Apply a bidirectional attention mechanism to achieve initial fusion between language and object features, allowing them to complement and enhance each other. Language features guide the model to focus on the visual aspects of the object related to the description, while the visual features of the object enrich the semantic information of the language description, representing the feature information of the language description. and object representation Alternating between query, key, and value, the process is as follows: (3) in, and ; Step 4.2: Integration of Objects and Linguistic Features and First they are connected, and then... The input features are fed into the module for adaptive fusion, effectively capturing the interactions between cross-modal features. First, channel blending is performed using convolution, then activation is applied using a function to obtain... Then, the processed features The inputs are given to the forward and backward modules, and the results are obtained respectively. and These modules work in parallel, capturing contextual information from different directions in the feature sequence; Step five includes the following: Binary cross-entropy loss BCELoss is used to supervise classification and matching performance. BCELoss measures the difference between the predicted probability of the target class and its true label, then calculates the cosine similarity between each object feature and the linguistic feature, and inputs the similarity score into the contrastive loss function to improve the model's ability to distinguish between matching and non-matching objects. Step 3.4 includes the following: Employing a dual-head attention mechanism, image features are... Considered a query , from 3D text features As a key Sum The formula for calculating cross-attention is as follows: (1) in, and ; Calculate the query-key attention graph using Q and K as described above. Weighted aggregation of the value V yields a visually and 3D text-aware query. The formula is as follows: (2) Where D is the length of the feature vector, that is, the dimension of each query, key, or value vector. ; Step 4.2 includes the following: (4) (5) (6) (7)